Theory of Alignment Generators and Applications to Statistical Machine Translation

Size: px
Start display at page:

Download "Theory of Alignment Generators and Applications to Statistical Machine Translation"

Transcription

1 Theory of Alignment Generators and Applications to Statistical Machine Translation Hemanta K Maji Raghavendra Udupa U IBM India Research Laboratory, New Delhi {hemantkm, uraghave}@inibmcom Abstract Viterbi Alignment and Decoding are two fundamental search problems in Statistical Machine Translation Both the problems are known to be NPhard and therefore, it is unlikely that there exists an optimal polynomial time algorithm for either of these search problems In this paper, we develop a mathematical framework for finding an approximate solution for the problems by limiting the search to a subspace of the original search space Towards this end, we introduce the notion of alignment generators and develop a theory of generators We give polynomial time algorithms for both Viterbi Alignment and Decoding Unlike the previous approaches that find the best solution in a polynomial sized search space, our algorithms are able to search through exponentially large subspaces in polynomial time We show that our algorithms find substantially better solutions in practice than the previously known algorithms for these problems 1 Introduction Statistical Machine Translation SMT is a data driven approach to Machine Translation [Brown et al, 1993], [Al- Onaizan et al, 1999], [Berger et al, 1996] Two of the fundamental problems in SMT are: a = argmax P r f, a e ViterbiAlignment a e,a = argmaxpr f,a e Pr e Decoding e,a Viterbi Alignment has a lot of applications in Natural Language Processing [Wang and Waibel, 1998], [Marcu and Wong, 2002] While there exist simple polynomial time algorithms for computing the Viterbi Alignment for IBM Models 1-2, only approximate algorithms are known for Models 3-5 Recently, [Udupa and Maji, 2005] showed that the computation of Viterbi Alignment is NP-Hard for IBM Models 3-5 Decoding algorithm is an essential component of all SMT systems [Wang and Waibel, 1997], [Tillman and Ney, 2000], [Och et al, 2001], [Germann et al, 2003], [Udupa et al, 2004] The problem is known to be NP-Hard for IBM Models 1-5 when the language model is a bigram model [Knight, 1999] Unless P = NP, it is unlikely that there exist polynomial time algorithms for either Viterbi Alignment or Decoding Therefore, there is a lot of interest in finding good approximate solutions for the problems efficiently Previous approaches to the Viterbi Alignment problem have focused on finding a local maximum using local search [Brown et al, 1993], [Al-Onaizan et al, 1999] The best known algorithm for Decoding employ a greedy local search to find a sub-optimal solution [Germann et al, 2003] However, local search algorithms can look at only polynomial number of points in the search space in polynomial time and are therefore, very sensitive to the choice of the initial point from where the search is began In this paper, we develop a mathematical framework for solving Viterbi Alignment and Decoding approximately, ie we focus on finding the optimal solution in a subspace of the search space The key difference between our approach and the previous approaches is the following Our algorithms construct an exponentially large subspace A and find the optimal solution in this subspace in polynomial time a = argmax P r f, a e ViterbiAlignment a A e,a = argmaxpr f,a e Pr e Decoding e,a A We introduce the notion of an alignment generator Section 2 and develop a theory of generators Section 32 Intuitively, an alignment generator is a minimal alignment that can be used to construct a large number of alignments through simple restructuring operations Associated with every alignment generator is an exponentially large family of alignments We show that for every alignment generator, it is possible to find the optimal solution for both Viterbi Alignment and Decoding in polynomial time if the search space is restricted to the family of alignments associated with the generator Sections 4 and 5 using Dynamic Programming We discuss the results of our experiments Section 6 2 Preliminaries IBM Models 1-5 assume that there is a hidden alignment between the source and target sentences Definition 1 Alignment An alignment is a function a m, l : {1,,m} {0,1,,l} {1,,m} is the set of source positions and {0,1,,l} is the set of target positions The target position 0 is known

2 m, l as the null position We denote a m, l j by a j The fertility of the target position i in alignment a m, l is φ i = m j=1 a δ m, l j,i If φ i = 0, then the target position i is an infertile position, otherwise it is a fertile position Let {k,,k + γ 1} {1,,l} be the longest range of consecutive infertile positions, where γ 0 We say that alignment a m, l has an infertility width of γ Representation of Alignments: Since all source positions that are not connected to any of the target positions 1,,l are assumed to be connected to the null position 0, we can represent an alignment, without loss of generality, as a set of tuples {j,a j a j 0} Let m = m φ 0 Thus, a sequence representation of an alignment has m tuples By sorting the tuples in ascending order using the second field as the key, a m, l can now be represented as a sequence m, l m, l m, l j 1,a j 1 j m,a j m where the a j k s are nondecreasing with increasing k Note that there are l + 1 m possible alignments for a given m,l > 0 Let a m, l v,u be the alignment {1,,v} {0,,u} where all the tuples j,a j such that j > v or a j > u have been m, l m, l removed Definition 2 Generator A generator is a bijection g m : {1,,m} {1,,m} Representation of Generators: A generator g m can be represented as a sequence of tuples j 1,1 j m,m Note that there are m! different possible generators for a given m > 0 The identity function g m j = j is denoted by g m 0 Definition 3 g m a m, l An alignment a m, l with m, l m, l tuple sequence j 1,a j 1 j m,a j m is said to be generated by a generator g m with tuple sequence k 1,1k m,m if j 1 j m is a subsequence of k 1 k m Note that the above definition does not say anything about the infertile positions More specifically, it does not take into account the infertility width of the alignment Therefore, we refine the notion of alignment generation Definition 4 g m γ a m, l An alignment a m, l is said to be γ-generated by g m, if g m a m, l and the infertility width of a m, l is at most γ Definition 5 A gm A gm = { a m, l g m γ m, a l} Thus, for every generator g m and integer γ 0 there is an associated family of alignments A gm for a given l Æ Further, Aγ, gm = l A gm Swap Operation: Let g m be a generator and j 1,1j k,kj k,k j m,m be its tuple sequence The result of a SWAP operation on the kth and k th tuples in the sequence is another tuple sequence j 1,1j k,kj k,k j m,m Note that the resulting tuple sequence also defines a generator g m The relationship between generators is stated by the following Lemma: Lemma 1 If g m and g m are two generators then g m can be obtained from g m by a series of swap operations Proof: Follows from the properties of permutations We can apply a sequence of Swap operators to modify any given generator g m to g m 0 The source position r v 1 v m in g m corresponds to the source position v in g m 0 For brevity, we denote a sequence v i v i+1 v j by v j i The source sentence is represented by fm 1 and the target sentence is represented by e l 1 3 Framework We now describe the common framework for solving Viterbi Alignment and Decoding The key idea is to construct an exponentially large subspace A corresponding to a generator g m and find the best solution in that subspace Our framework is as follows 1 Choose any generator g m as the initial generator 2 Find the best solution for the problem in A 3 Pick a better generator g m by applying a swap operation on g m 4 Repeat steps 2 and 3 until the solution cannot be improved further 31 Algorithms Viterbi Alignment The algorithm for finding a good approximate solution to Viterbi Alignment employs a subroutine Viterbi For Generator to find the best solution in the exponentially large subspace defined by A gm The implementation of this subroutine is discussed in Section 4 Algorithm 1 Viterbi Alignment 1: g m g m 0 2: while true do 3: a m, l = Viterbi For Generator g m,f1 m,e1 l 4: Find an alignment a m, l with higher score by using swap operations on a m, l 5: if a m, l = NIL then 6: break 7: end if 8: g m A generator of a m, l 9: end while m, l 10: Output a Decoding Our algorithm for Decoding employs a subroutine Decode For Generator to find the best solution in the exponentially large subspace defined by Aγ, gm The implementation of this subroutine is discussed in Section 5

3 Algorithm 2 Decoding 1: g m g m 0 2: while true do 3: e l 1,a m, l = Decode For Generator g m,f1 m 4: Find an alignment a m, l with higher score by using swap operations on a m, l and e l 1 5: if a m, l = NIL then 6: break 7: end if 8: g m A generator of a m, l 9: end while 10: Output e l 1 32 Theory of Generators In this section, we prove some important results on generators We begin with a few useful observations regarding A gm : 1 If γ < γ, then A gm γ, l A gm and A gm γ, Agm γ, 2 For any alignment a m, l with infertility width γ, there exists g m, such that a m, l A gm 3 For any alignment a m, l there exists g m, such that a m, l A gm γ=l, l Structure Transformation Operations We define a set of structure transformation operations on generators The application of each of these operations on a generator modifies the structure of a generator and produce an alignment Without loss of generality 1, we assume that the generator is the identity generator g m 0 and define the structural transformation operations on it The tuple sequence for g m 0 is 1,1 m,m Given an alignment a j 1,i A gj 1 0 with φ i > 0, we explain the modifications introduced by each of these operators to the jth tuple 1 SHRINK: Extend a j 1,i A gj 1 0 such that a j,i j = 0 2 MERGE: Extend a j 1,i A gj 1 0 such that a j,i j = i to a j,i A gj 0 to a j,i A gj 0 3 γ,k-grow 0 k γ: Extend a j 1,i A gj 1 0 to a j,i+k+1 A gj 0 1 +k+1 such that aj,i+k+1 j = i+k+ By removing the infertile positions at the end, we obtain the following result: Lemma 2 a m, l A gm 0 and has at least one fertile position iff there exists s such that a m, l m, l s A gm 0 s, where 0 s γ and φ l s > 0 g m 0 1 by permuting the given generator g m to the identity generator Theorem 1 If a m,l s A gm s such that φ l s > 0 then it can be obtained from g m 0 by a series of SHRINK, MERGE and γ, k-grow operations Proof: WLOG we assume that the generator is g m 0 We provide a construction that has m phases In the vth step, we construct a m, l s v,u from g v 0 in the following manner: We construct a m, l s v 1,u from g v 1 0 and employ m, l s a SHRINK operation on the tuple v,v if a v = 0 We construct a m, l s v 1,u from g v 1 0 and employ a MERGE operation on the tuple v,v if φ u > 1 If φ u = 1: Let a m, l s v,u define φ t = 0 for u k t u 1 and φ u k 1 > 0, where 0 k γ Construct a m, l s v 1,u k 1 from g v 1 0 and apply γ,k-grow to the v,v tuple Note: If u k 1 = 0, then consider the alignment where {1,,v 1} are all aligned to the null position As these are the only possibilities in the vth phase, at the end of the mth phase we get a m,l s Lemma 2 and Theorem 1 guarantee the correctness of the dynamic programming algorithms for the Viterbi Alignment and Decoding problems Lemma 3 is given by the coefficient of x l in A gm αγ 2 x [ α γ 2, where α γ = 2 + γ k=0 xk+1 α m γ 1 α γ 1 Proof: See Appendix A for proof Theorem 2 A gm = γ + 1 γ, ] + 1 γ+1γ+3 m +1 γ+2 Proof: Substitute x = 1 in Equation 1 Decode For Generator finds the optimal solution over all possible alignments in Aγ, gm Theorem 3 For a fixed l, in polynomial time = Ω2 m A gm Proof: See Appendix B for proof Viterbi For Generator finds the optimal solution over all possible alignments in A gm in polynomial time { } Lemma 4 Let S = g m 1,,g m N be a set of generators Define span S = N i=1 Agm i γ=l, l Let p be a polynomial For l 2, if N = O pm, then there exists an alignment a m, l such that a m, l span S Proof: See Appendix C for proof This Lemma shows that by considering any polynomially large set of generators we can not obtain the optimal solutions to either Viterbi Alignment or Decoding 1

4 4 Viterbi Alignment We now develop the algorithm for the subroutine Viterbi For Generator for IBM Model 3 Recall that this subroutine solves the following problem: a m, l = argmax Pr f1 m,a m, l e l 1 a m, l A gm Our algorithm has a time complexity polynomial in m and l Without loss of generality, we assume that g m = g m 0 Note that a m, l m, l s where 0 s γ and φ l s > 0 can be obtained from g 0 using the structure transformation operations described in Section 32 Therefore, we build the Viterbi alignment from left to right We scan the tuple sequence from left to right in m phases and in any phase we determine the best structure transformation operation for that phase We consider all possible partial alignments between f1 v and e u 1, such that φ u = ϕ > 0 and φ 0 = ϕ 0 Let B u,v,ϕ 0,ϕ be the best partial alignment between f1 v and e u 1 and Au,v,ϕ 0,ϕ be its score 2 Here, ϕ 0 is the number of French words aligned to e 0 in the partial alignment The key idea here is to compute Au,v,ϕ 0,ϕ and B u,v,ϕ 0,ϕ recursively using Dynamic Programming In the vth phase corresponding to the tuple v,v, we consider each of the structure transformation operation 1 SHRINK: If ϕ 0 > 0, the SHRINK operation extends B u,v 1,ϕ 0 1,φ and the score of the resulting partial alignment is s 1 = cau,v 1,ϕ 0 1,ϕ where c = tf v NULL p1 m 2ϕ 0+1m 2ϕ 0+2 p 2 0 m ϕ 0+1ϕ 0 2 MERGE: If ϕ > 0, the MERGE operation extends B u,v 1,ϕ 0,ϕ 1 and the score of the resulting partial alignment is s 2 = cau,v 1,ϕ 0,ϕ 1 where c = nϕ euφ nϕ 1 e u tf v e u dr v u,l,m 3 γ, k-grow: Let ϕ = argmaxb u k 1,v 1,ϕ 0,k k γ,k-grow extends B u k 1,v 1,ϕ 0,ϕ if ϕ = 1 If u k 1 > 0, the score of the resulting partial alignment is where s 3,k = cau k 1,v 1,ϕ 0,ϕ c = n1 e u tf v e u dr v u,l,m u 1 p=u k n0 e p 2 A naive implementation of the table B,,, takes O m size for each element It can instead be implemented as a table of decisions and a pointer to the alignment that was extended to obtain it This makes the size of each entry O 1 If u k 1 = 0, then v 1 m ϕ0 m 2ϕ s 3,k = p 0 ϕ ϕ 0 p 0 1 tf p e 0 0 p=1 n1 e u tf v e u dr v u,l,m u 1 p=u k n0 e p We choose that operation which gives the best score As a result, B u,v,ϕ 0,ϕ now represents the best partial alignment so far and Au,v,ϕ 0,ϕ represents its score Now, we have Au,v,ϕ 0,ϕ = argmax Pr f1 v,a v, u e u 1 a v, u A gv γ, u φ u=ϕ,φ 0=ϕ 0 We determine Au,v,ϕ 0,ϕ and B u,v,ϕ 0,ϕ, for all 0 u l, 0 v m, 0 < ϕ ϕ max and 0 ϕ 0 m 2 To complete the scores of the alignments, we multiply each Au,m,ϕ 0,ϕ by l p=u+1 n0 e p for l γ u l Let û, ˆϕ 0, ˆϕ = argmax Au,m,ϕ 0,ϕ l γ u l 0 ϕ 0 m 2 0< ϕ ϕ max The Viterbi Alignment is, therefore, B û,m, ˆϕ 0, ˆϕ 41 Complexity Time Complexity The algorithm computes B u,v,ϕ 0,ϕ and Au,v,ϕ 0,ϕ and therefore, there are O lm 2 ϕ max entries that need to be computed Note that each of these entries is computed incrementally by a structure transformation operation SHRINK operation requires O 1 time MERGE operation requires O 1 time γ,k-grow 0 k γ operation requires O ϕ max time for findng the best alignment and another O 1 time to update the score 3 So, each iteration takes O γϕ max time Therefore, the computation of the tables A and B takes O lm 2 γϕ 2 max time The final step of finding the Viterbi alignment from the table entries takes O mγϕ max time 4 Therefore, the algorithm takes totally O lm 2 γϕ max time In practice, γ and ϕ max are taken to be constants Therefore, the time complexity of this algorithm is O lm 2 Space Complexity There are O lm 2 ϕ max entries in each of the two tables Assuming γ and ϕ max to be constants, the space complexity ofviterbi For Generator is O lm 2 IBM Models 4-5 Extending this procedure for IBM Models 4-5 is easy and we leave it to the reader to figure out the modifications 3 Note: The product u 1 p=u kn0 ep can be calculated incrementally for each k, so that at each iteration only one multiplication needs to be done Similarly, the score v 1 p=1tfp NULL can be calculated incrementally over the loop on v 4 The factor l u+1n 0 ep can be calculated incrementally over the loop of u So, it takes O 1 time in each iteration

5 5 Decoding The algorithm for the subroutine Decode For Generator is similar in spirit to the algorithm for Viterbi For Generator We provide the details for IBM Model 4 as it is the most popular choice for Decoding We assume that the language model is a trigram model V E is the target language vocabulary and I E V E is the set of infertile words Our algorithm employs Dynamic Programming and computes the following: e L 1,a m,l = argmax Pr a m,l A gm γ,, e L 1 f m 1,a m,l e L 1 Let H v 1 e,e,ϕ 0,ϕ,ρ be the set of all partial hypotheses which are translations of f1 v and have e e as their last two words ρ is the center of cept associated with e and ϕ > 0 is the fertility of e Observe that the scores of all partial hypothesis in H v 1 e,e,ϕ 0,ϕ,ρ are incremented by the same amount thereafter if the partial hypotheses are extended with the same sequence of operations Therefore, for every H v 1 v,e,e,ϕ 0,ϕ,ρ it suffices to work only on the best partial hypothesis in it The initial partial hypothesis is h 0,,0,0,0 with a score of p m 0 Initially, the hypothesis set H has only one hypothesis, h 0,,0,0,0 We scan the tuple sequence of g m 0 left to right in m phases and build the translation left to right In the vth phase, we extend each partial hypothesis in H by employing the structure transformation operations Let h v 1 1 e,e,ϕ 0,ϕ,ρ be the partial hypothesis which is being extended SHRINK: We create a new partial hypothesis h v 1 e,e,ϕ 0 + 1,ϕ,ρ whose score is the score of h v 1 1 e,e,ϕ 0,ϕ,ρ times tf v NULL p 1 p 0 2 m 2ϕ 0 1 m 2ϕ 0 m ϕ 0 ϕ We put the new hypothesis in to H v 1 e,e,ϕ 0 + 1,ϕ,ρ MERGE: We create a new partial hypothesis h v 1 e,e,ϕ 0,ϕ + 1,ρ whose score is the score of h v 1 1 e,e,ϕ 0,ϕ,ρ times nϕ + 1 e ϕ + 1 nϕ e tf v e d new d old Here, d old is the distortion probability score of the last tableau before expansion, d new is the distortion probability of the expanded hypothesis after r v is inserted into the tableau and ρ is the new center of the cept associated with e We put the new hypothesis into H v 1 e,e,ϕ 0,ϕ + 1,ρ γ,k-grow: We choose k words e 1,,e k from I E and one word 62 Viterbi Alignment e k+1 from V E We create a We compare our results with those of local search algorithm new partial hypothesis h v 1 e,e 1,ϕ 0,1,r v if k = 0, of GIZA++ [Och, 2001] We set γ = 4 and ϕ max = 10 h v 1 e k,e k+1 The initial generator for our algorithm is obtained from the,ϕ 0,1,r v otherwise The score of this hypothesis is the score of h v 1 solution of the local search algorithm 1 e,e,ϕ 0,ϕ,ρ times In Table 3, we report the mean logscores of the alignments k for each length class found by our algorithm and the local n1 e k+1 tf v e k+1 dr v ρ A e, B f v n0 e p search algorithm Our scores are about 5% better than those of the local search algorithm p=1 We put the new hypothesis in to H v 1 e k+1,e k,ϕ 0,1,r v At the end of the phase, we retain only the best hypothesis for each H v 1 e,e,ϕ 0,ϕ,ρ After m phases, we append at most γ infertile words to all hypotheses in H The hypothesis with the best score in H is the output of the algorithm 51 Complexity Time Complexity At the beginning of vth phase, there are at most O V E 2 v 2 ϕ max distinct partial hypotheses It takes O 1 + O ϕ max + O I E γ V E ϕ max time to extend a hypothesis Therefore, the algorithm takes totally O V E 3 I E γ m 3 ϕ 2 max time Now, since V E, I E, ϕ max and γ are assumed to be constant, the time complexity of the algorithm is O m 3 Space Complexity In each phase, we need O V E 2 m 2 ϕ max space Assuming V E and ϕ max to be constant, the space complexity is O m 2 6 Experiments and Results 61 Decoding We trained the models using the GIZA++ tool [Och, 2001] for the translation direction French to English For the current set of experiments, we set γ = 1 and ϕ max = 10 To compare our results with those of a state of the art decoder, we used the Greedy decoder [Marcu and Och, 2004] Our test data consisted of a corpus of French sentences with sentence length in the range 06 10, 11 15,,56 60 Each sentence class had 100 French sentences In Table 1, we compare the mean logscores negative logarithm of the probability of each of the length classes Lower logscore indicates better probability score It can be easily seen that the scores for our algorithm are about 20% better than the scores for Greedy algorithm In the same table, we report the average time required for decoding for each sentence class Our algorithm is much faster than the Greedy algorithm for most length classes This demonstrates the power of our algorithm to search an exponentially large subspace in polynomial time We compare the NIST and BLEU scores of our solutions with the Greedy solutions Our NIST scores are about 14% better while our BLEU scores are about 16% better than those of the Greedy solutions

6 Time sec Score Class Our Greedy Our Greedy Table 1: Mean Time and Mean logscores BLEU NIST Class Our Greedy Our Greedy Table 2: BLEU and NIST Scores 7 Conclusions We have demonstrated the strength of theory of alignment generators on two problems in SMT As alignments are fundamental to many NLP applications, we believe that the theory developed in this paper will prove useful for those applications as well Acknowledgments References [Al-Onaizan et al, 1999] Y Al-Onaizan, J Curin, M Jahr, K Knight, J Lafferty, D Melamed, F Och, D Purdy, N Smith, and D Yarowsky Statistical machine translation: Final reprot JHU Workshop, 1999 [Berger et al, 1996] A Berger, P Brown, S Della Pietra, V Della Pietra, A Kehler, and R Mercer Language translation apparatus and method using context-based translation models United States Patent 5,510,981, 1996 [Brown et al, 1993] P Brown, S Della Pietra, V Della Pietra, and R Mercer The mathematics of machine translation: Parameter estimation Computational Linguistics, 192: , 1993 [Germann et al, 2003] U Germann, M Jahr, D Marcu, and K Yamada Fast decoding and optimal decoding for machine translation Artificial Intelligence, 2003 Class Our Greedy Table 3: Viterbi Alignment Logscores [Knight, 1999] Kevin Knight Decoding complexity in wordreplacement translation models Computational Linguistics, 254, 1999 [Marcu and Och, 2004] D Marcu and F Och Greedy decoder for statistical machine translation isiedu/licensed-sw/rewrite-decoder, 2004 [Marcu and Wong, 2002] D Marcu and W Wong A phrasebased, joint probability model for statistical machine translation In Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP, pages Philadelphia, 2002 [Och et al, 2001] F Och, N Ueffing, and H Ney An efficient a* search algorithm for statistical machine translation In Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, pages Toulouse, France, 2001 [Och, 2001] F Och Giza++ informatikrwth-aachende/colleagues/ och/software/giza++html, 2001 [Tillman and Ney, 2000] C Tillman and H Ney Word reordering and dp-based search in statistical machine translation In Proceedings of the 18th COLING, pages Saarbrucken, Germany, 2000 [Udupa and Maji, 2005] R Udupa and H Maji On the computational complexity of statistical machine translation 2005 [Udupa et al, 2004] R Udupa, T Faruquie, and H Maji An algorithmic framework for the decoding problem in statistical machine translation In Proceedings of COLING 2004 Geneva, Switzerland, 2004 [Wang and Waibel, 1997] Y Wang and A Waibel Decoding algorithm in statistical machine translation In Proceedings of the 35th ACL, pages Madrid, Spain, 1997 [Wang and Waibel, 1998] Y Wang and A Waibel Modeling with structures in statistical machine translation In Proceedings of the 36th ACL Montreal, Canada, 1998

7 A Proof of Lemma 3 Without loss of generality we will work with g m 0 Let Au,v be the number of alignments in A gv 0 γ, u such that φ u > 0 Then we can write the following recurrence relation: Au,v = 2Au,v 1 + B u k 1,v 1 k=0 Au,v if u > 0 Bu,v = 1 if u = 0 0 otherwise The base cases are Au,v = 0 if u = 0 or v = 0 To, write the generating function for Au,v we use the notation A v x = Au,v x u u=0 Then the recurrence relation can be written as A v x = 2 + x A k+1 v 1 x + k=0 k=0 x k+1 We represent α γ = 2 + γ k=0 xk+1 and we know A 0 x = 0 Using these we obtain the solution α v γ 1 A v x = α γ 2 α γ 1 C Proof of Lemma 4 For l = 0 and l = 1, it is easy to see that the generator g m 0 suffices to generate all possible alignments in fact any generator suffices For l = 2, we will count the number of alignments in A gm 0 γ=l, l because all generators have same size of generated sets when m > 0 Let Au,v be the number of alignments in A gv 0 γ=l, u such that φ u > 0 Let N u,v = Agv 0 γ=l, u We know that A0,v = 0 and N 0,v = 1 It is easy to see that A1,v = 2 v 1 and N 1,v = 2 v A2,v = 2A2,v 1 + A1,v So, A2,v = v v and N 2,v = v v + 2 v = m m for l = 2 If N kpm, then A gm γ=l, l span S kpm m m, where k is a constant Total number of possible alignments when l = 2 and m > 0 is 3 m Now, it is impossible to have 3 m kpm m m for any polynomial p Hence the result follows for l = 2 For, l > 2, consider the set of alignments where φ u = 0 for all u > 2 Now the result for l = 2 can be easily extended for these sets of alignments Hence, for l 2, if N = O pm, where p is a polynomial, then there exists an alignment a m, l such that a m, l span S Now, a m, l A gm 0 and has at least one fertile position iff there exists s such that a m, l m, l s A gm 0 s, where 0 s γ and φ l s > 0 If a m, l A gm 0 has no fertile position then we need to separately account for the alignment where all the source positions are aligned to the null position It is easy to see that the counting involved is exclusive So, 0 is given by the expression Agm Al s,m + δ l s,0 s=0 This is coefficient of x l in [ αγ 2 α m ] γ 1 α γ x α γ 1 2 B Proof of Theorem 3 Coefficient of x l in Equation 1 refer Appendix A is greater than Al, m From the recurrence relation in Appendix A it is clear that for a fixed l, Al,m 2Al,m 1 So, Al,m = Ω2 m for fixed value of l Therefore, coefficient of x l in Equation 1 is Ω2 m for a fixed l

Theory of Alignment Generators and Applications to Statistical Machine Translation

Theory of Alignment Generators and Applications to Statistical Machine Translation Theory of Alignment Generators and Applications to Statistical Machine Translation Raghavendra Udupa U Hemanta K Mai IBM India Research Laboratory, New Delhi {uraghave, hemantkm}@inibmcom Abstract Viterbi

More information

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School - University of Pisa Pisa, 7-19 May 008 Part III: Search Problem 1 Complexity issues A search: with single

More information

Computing Optimal Alignments for the IBM-3 Translation Model

Computing Optimal Alignments for the IBM-3 Translation Model Computing Optimal Alignments for the IBM-3 Translation Model Thomas Schoenemann Centre for Mathematical Sciences Lund University, Sweden Abstract Prior work on training the IBM-3 translation model is based

More information

Generalized Stack Decoding Algorithms for Statistical Machine Translation

Generalized Stack Decoding Algorithms for Statistical Machine Translation Generalized Stack Decoding Algorithms for Statistical Machine Translation Daniel Ortiz Martínez Inst. Tecnológico de Informática Univ. Politécnica de Valencia 4607 Valencia, Spain dortiz@iti.upv.es Ismael

More information

Probabilistic Word Alignment under the L 0 -norm

Probabilistic Word Alignment under the L 0 -norm 15th Conference on Computational Natural Language Learning, Portland, Oregon, 2011 Probabilistic Word Alignment under the L 0 -norm Thomas Schoenemann Center for Mathematical Sciences Lund University,

More information

The Geometry of Statistical Machine Translation

The Geometry of Statistical Machine Translation The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide

More information

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples Statistical NLP Spring 2009 Machine Translation: Examples Lecture 17: Word Alignment Dan Klein UC Berkeley Corpus-Based MT Levels of Transfer Modeling correspondences between languages Sentence-aligned

More information

statistical machine translation

statistical machine translation statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical

More information

Statistical NLP Spring HW2: PNP Classification

Statistical NLP Spring HW2: PNP Classification Statistical NLP Spring 2010 Lecture 16: Word Alignment Dan Klein UC Berkeley HW2: PNP Classification Overall: good work! Top results: 88.1: Matthew Can (word/phrase pre/suffixes) 88.1: Kurtis Heimerl (positional

More information

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment Statistical NLP Spring 2010 Lecture 16: Word Alignment Dan Klein UC Berkeley HW2: PNP Classification Overall: good work! Top results: 88.1: Matthew Can (word/phrase pre/suffixes) 88.1: Kurtis Heimerl (positional

More information

Structure and Complexity of Grammar-Based Machine Translation

Structure and Complexity of Grammar-Based Machine Translation Structure and of Grammar-Based Machine Translation University of Padua, Italy New York, June 9th, 2006 1 2 Synchronous context-free grammars Definitions Computational problems 3 problem SCFG projection

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Statistical NLP Spring Corpus-Based MT

Statistical NLP Spring Corpus-Based MT Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein UC Berkeley Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it

More information

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1 Statistical NLP Spring 2010 Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See

More information

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto Marta R. Costa-jussà, Josep M. Crego, Adrià de Gispert, Patrik Lambert, Maxim Khalilov, José A.R. Fonollosa, José

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier Decoding in Statistical Machine Translation Christian Hardmeier 2016-05-04 Mid-course Evaluation http://stp.lingfil.uu.se/~sara/kurser/mt16/ mid-course-eval.html Decoding The decoder is the part of the

More information

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way

More information

1 Evaluation of SMT systems: BLEU

1 Evaluation of SMT systems: BLEU 1 Evaluation of SMT systems: BLEU Idea: We want to define a repeatable evaluation method that uses: a gold standard of human generated reference translations a numerical translation closeness metric in

More information

Out of GIZA Efficient Word Alignment Models for SMT

Out of GIZA Efficient Word Alignment Models for SMT Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza

More information

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Spyros Martzoukos Christophe Costa Florêncio and Christof Monz Intelligent Systems Lab

More information

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015 Decoding Revisited: Easy-Part-First & MERT February 26, 2015 Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation Yin-Wen Chang MIT CSAIL Cambridge, MA 02139, USA yinwen@csail.mit.edu Michael Collins Department of Computer Science, Columbia

More information

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online

More information

Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation Abstract This paper describes an algorithm for exact decoding of phrase-based translation models, based on Lagrangian relaxation.

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Analysis of Algorithms I: Asymptotic Notation, Induction, and MergeSort

Analysis of Algorithms I: Asymptotic Notation, Induction, and MergeSort Analysis of Algorithms I: Asymptotic Notation, Induction, and MergeSort Xi Chen Columbia University We continue with two more asymptotic notation: o( ) and ω( ). Let f (n) and g(n) are functions that map

More information

Natural Language Processing (CSEP 517): Machine Translation

Natural Language Processing (CSEP 517): Machine Translation Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Word Alignment by Thresholded Two-Dimensional Normalization

Word Alignment by Thresholded Two-Dimensional Normalization Word Alignment by Thresholded Two-Dimensional Normalization Hamidreza Kobdani, Alexander Fraser, Hinrich Schütze Institute for Natural Language Processing University of Stuttgart Germany {kobdani,fraser}@ims.uni-stuttgart.de

More information

6.864 (Fall 2007) Machine Translation Part II

6.864 (Fall 2007) Machine Translation Part II 6.864 (Fall 2007) Machine Translation Part II 1 Recap: The Noisy Channel Model Goal: translation system from French to English Have a model P(e f) which estimates conditional probability of any English

More information

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Machine Translation II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Project 4: Word Alignment! Will be released soon! (~Monday) Phrase-Based System Overview

More information

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Word Alignment for Statistical Machine Translation Using Hidden Markov Models Word Alignment for Statistical Machine Translation Using Hidden Markov Models by Anahita Mansouri Bigvand A Depth Report Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of

More information

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems Wolfgang Macherey Google Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA wmach@google.com Franz

More information

Phrase-Based Statistical Machine Translation with Pivot Languages

Phrase-Based Statistical Machine Translation with Pivot Languages Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008

More information

A phrase-based hidden Markov model approach to machine translation

A phrase-based hidden Markov model approach to machine translation A phrase-based hidden Markov model approach to machine translation Jesús Andrés-Ferrer Universidad Politécnica de Valencia Dept. Sist. Informáticos y Computación jandres@dsic.upv.es Alfons Juan-Císcar

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

Computing Lattice BLEU Oracle Scores for Machine Translation

Computing Lattice BLEU Oracle Scores for Machine Translation Computing Lattice Oracle Scores for Machine Translation Artem Sokolov & Guillaume Wisniewski & François Yvon {firstname.lastname}@limsi.fr LIMSI, Orsay, France 1 Introduction 2 Oracle Decoding Task 3 Proposed

More information

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Multi-Task Word Alignment Triangulation for Low-Resource Languages Multi-Task Word Alignment Triangulation for Low-Resource Languages Tomer Levinboim and David Chiang Department of Computer Science and Engineering University of Notre Dame {levinboim.1,dchiang}@nd.edu

More information

Monte Carlo techniques for phrase-based translation

Monte Carlo techniques for phrase-based translation Monte Carlo techniques for phrase-based translation Abhishek Arun (a.arun@sms.ed.ac.uk) Barry Haddow (bhaddow@inf.ed.ac.uk) Philipp Koehn (pkoehn@inf.ed.ac.uk) Adam Lopez (alopez@inf.ed.ac.uk) University

More information

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California

More information

Phrasetable Smoothing for Statistical Machine Translation

Phrasetable Smoothing for Statistical Machine Translation Phrasetable Smoothing for Statistical Machine Translation George Foster and Roland Kuhn and Howard Johnson National Research Council Canada Ottawa, Ontario, Canada firstname.lastname@nrc.gc.ca Abstract

More information

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this. Chapter 12 Synchronous CFGs Synchronous context-free grammars are a generalization of CFGs that generate pairs of related strings instead of single strings. They are useful in many situations where one

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Lexical Translation Models 1I

Lexical Translation Models 1I Lexical Translation Models 1I Machine Translation Lecture 5 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time... X p( Translation)= p(, Translation)

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

Lexical Translation Models 1I. January 27, 2015

Lexical Translation Models 1I. January 27, 2015 Lexical Translation Models 1I January 27, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { p(e f,m)=

More information

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations

More information

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty Outlines Automatic Speech Recognition and Statistical Machine Translation under Uncertainty Lambert Mathias Advisor: Prof. William Byrne Thesis Committee: Prof. Gerard Meyer, Prof. Trac Tran and Prof.

More information

Improved Decipherment of Homophonic Ciphers

Improved Decipherment of Homophonic Ciphers Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Marrying Dynamic Programming with Recurrent Neural Networks

Marrying Dynamic Programming with Recurrent Neural Networks Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying

More information

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Graeme Blackwood, Adrià de Gispert, William Byrne Machine Intelligence Laboratory Cambridge

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation -tree-based models (cont.)- Artem Sokolov Computerlinguistik Universität Heidelberg Sommersemester 2015 material from P. Koehn, S. Riezler, D. Altshuler Bottom-Up Decoding

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Factorization of the Robinson-Schensted-Knuth Correspondence

Factorization of the Robinson-Schensted-Knuth Correspondence Factorization of the Robinson-Schensted-Knuth Correspondence David P. Little September, 00 Abstract In [], a bijection between collections of reduced factorizations of elements of the symmetric group was

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Models: All the Glorious Gory Details Hidden Markov Models: All the Glorious Gory Details Noah A. Smith Department of Computer Science Johns Hopkins University nasmith@cs.jhu.edu 18 October 2004 1 Introduction Hidden Markov models (HMMs, hereafter)

More information

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap

More information

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses) CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011 Going forward

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

CSCE 222 Discrete Structures for Computing

CSCE 222 Discrete Structures for Computing CSCE 222 Discrete Structures for Computing Algorithms Dr. Philip C. Ritchey Introduction An algorithm is a finite sequence of precise instructions for performing a computation or for solving a problem.

More information

A Greedy Decoder for Phrase-Based Statistical Machine Translation

A Greedy Decoder for Phrase-Based Statistical Machine Translation A Greedy Decoder for Phrase-Based Statistical Machine Translation Philippe Langlais, Alexandre Patry and Fabrizio Gotti Département d Informatique et de Recherche Opérationnelle Université de Montréal,

More information

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation David Vilar, Daniel Stein, Hermann Ney IWSLT 2008, Honolulu, Hawaii 20. October 2008 Human Language Technology

More information

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions Chapter 3 Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions 3.1 Scattered Data Interpolation with Polynomial Precision Sometimes the assumption on the

More information

Discriminative Training

Discriminative Training Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder

More information

Maximal Lattice Overlap in Example-Based Machine Translation

Maximal Lattice Overlap in Example-Based Machine Translation Maximal Lattice Overlap in Example-Based Machine Translation Rebecca Hutchinson Paul N. Bennett Jaime Carbonell Peter Jansen Ralf Brown June 6, 2003 CMU-CS-03-138 School of Computer Science Carnegie Mellon

More information

The Shortest Vector Problem (Lattice Reduction Algorithms)

The Shortest Vector Problem (Lattice Reduction Algorithms) The Shortest Vector Problem (Lattice Reduction Algorithms) Approximation Algorithms by V. Vazirani, Chapter 27 - Problem statement, general discussion - Lattices: brief introduction - The Gauss algorithm

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Analysis II: The Implicit and Inverse Function Theorems

Analysis II: The Implicit and Inverse Function Theorems Analysis II: The Implicit and Inverse Function Theorems Jesse Ratzkin November 17, 2009 Let f : R n R m be C 1. When is the zero set Z = {x R n : f(x) = 0} the graph of another function? When is Z nicely

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Triplet Lexicon Models for Statistical Machine Translation

Triplet Lexicon Models for Statistical Machine Translation Triplet Lexicon Models for Statistical Machine Translation Saša Hasan, Juri Ganitkevitch, Hermann Ney and Jesús Andrés Ferrer lastname@cs.rwth-aachen.de CLSP Student Seminar February 6, 2009 Human Language

More information

Gappy Phrasal Alignment by Agreement

Gappy Phrasal Alignment by Agreement Gappy Phrasal Alignment by Agreement Mohit Bansal UC Berkeley, CS Division mbansal@cs.berkeley.edu Chris Quirk Microsoft Research chrisq@microsoft.com Robert C. Moore Google Research robert.carter.moore@gmail.com

More information

Coverage Embedding Models for Neural Machine Translation

Coverage Embedding Models for Neural Machine Translation Coverage Embedding Models for Neural Machine Translation Haitao Mi Baskaran Sankaran Zhiguo Wang Abe Ittycheriah T.J. Watson Research Center IBM 1101 Kitchawan Rd, Yorktown Heights, NY 10598 {hmi, bsankara,

More information

Bayesian Learning of Non-compositional Phrases with Synchronous Parsing

Bayesian Learning of Non-compositional Phrases with Synchronous Parsing Bayesian Learning of Non-compositional Phrases with Synchronous Parsing Hao Zhang Computer Science Department University of Rochester Rochester, NY 14627 zhanghao@cs.rochester.edu Chris Quirk Microsoft

More information

5 Linear Transformations

5 Linear Transformations Lecture 13 5 Linear Transformations 5.1 Basic Definitions and Examples We have already come across with the notion of linear transformations on euclidean spaces. We shall now see that this notion readily

More information

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories 2 Speech Translation Structure Single-best only ASR Single-best

More information

Applications of Hidden Markov Models

Applications of Hidden Markov Models 18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation

More information

Word Alignment via Submodular Maximization over Matroids

Word Alignment via Submodular Maximization over Matroids Word Alignment via Submodular Maximization over Matroids Hui Lin, Jeff Bilmes University of Washington, Seattle Dept. of Electrical Engineering June 21, 2011 Lin and Bilmes Submodular Word Alignment June

More information

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A =

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A = STUDENT S COMPANIONS IN BASIC MATH: THE ELEVENTH Matrix Reloaded by Block Buster Presumably you know the first part of matrix story, including its basic operations (addition and multiplication) and row

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Word Alignment. Chris Dyer, Carnegie Mellon University

Word Alignment. Chris Dyer, Carnegie Mellon University Word Alignment Chris Dyer, Carnegie Mellon University John ate an apple John hat einen Apfel gegessen John ate an apple John hat einen Apfel gegessen Outline Modeling translation with probabilistic models

More information

Incremental HMM Alignment for MT System Combination

Incremental HMM Alignment for MT System Combination Incremental HMM Alignment for MT System Combination Chi-Ho Li Microsoft Research Asia 49 Zhichun Road, Beijing, China chl@microsoft.com Yupeng Liu Harbin Institute of Technology 92 Xidazhi Street, Harbin,

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu In the neural encoder-decoder

More information

P, NP, NP-Complete, and NPhard

P, NP, NP-Complete, and NPhard P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course

More information

Sampling Alignment Structure under a Bayesian Translation Model

Sampling Alignment Structure under a Bayesian Translation Model Sampling Alignment Structure under a Bayesian Translation Model John DeNero, Alexandre Bouchard-Côté and Dan Klein Computer Science Department University of California, Berkeley {denero, bouchard, klein}@cs.berkeley.edu

More information

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions Wei Lu and Hwee Tou Ng National University of Singapore 1/26 The Task (Logical Form) λx 0.state(x 0

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3 MA008 p.1/37 MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3 Dr. Markus Hagenbuchner markus@uow.edu.au. MA008 p.2/37 Exercise 1 (from LN 2) Asymptotic Notation When constants appear in exponents

More information