Note on EM-training of IBM-model 1

Similar documents
Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

THE SUMMATION NOTATION Ʃ

Section 8.3 Polar Form of Complex Numbers

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Assortment Optimization under MNL

Lecture 12: Discrete Laplacian

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Homework Assignment 3 Due in class, Thursday October 15

On the correction of the h-index for career length

Module 9. Lecture 6. Duality in Assignment Problems

1 Matrix representations of canonical matrices

1 GSW Iterative Techniques for y = Ax

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Kernel Methods and SVMs Extension

1 Convex Optimization

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Lecture 10: May 6, 2013

Chapter Newton s Method

Vapnik-Chervonenkis theory

A Robust Method for Calculating the Correlation Coefficient

Limited Dependent Variables

Problem Set 9 Solutions

PHYS 705: Classical Mechanics. Newtonian Mechanics

The Feynman path integral

COS 511: Theoretical Machine Learning

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Density matrix. c α (t)φ α (q)

18.1 Introduction and Recap

1 The Mistake Bound Model

Physics 240: Worksheet 30 Name:

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Complex Numbers. x = B B 2 4AC 2A. or x = x = 2 ± 4 4 (1) (5) 2 (1)

Split alignment. Martin C. Frith April 13, 2012

Randomness and Computation

Notes on Frequency Estimation in Data Streams

Lecture 4. Instructor: Haipeng Luo

Chapter Twelve. Integration. We now turn our attention to the idea of an integral in dimensions higher than one. Consider a real-valued function f : D

Generalized Linear Methods

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Basic Regular Expressions. Introduction. Introduction to Computability. Theory. Motivation. Lecture4: Regular Expressions

Important Instructions to the Examiners:

HMMT February 2016 February 20, 2016

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

Course 395: Machine Learning - Lectures

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Hidden Markov Model Cheat Sheet

Moments of Inertia. and reminds us of the analogous equation for linear momentum p= mv, which is of the form. The kinetic energy of the body is.

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

From Biot-Savart Law to Divergence of B (1)

Difference Equations

Lecture 10 Support Vector Machines II

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

Lecture Notes on Linear Regression

Solutions to Problem Set 6

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

Gaussian Mixture Models

MAXIMUM A POSTERIORI TRANSDUCTION

CS286r Assign One. Answer Key

Turing Machines (intro)

Workshop: Approximating energies and wave functions Quantum aspects of physical chemistry

} Often, when learning, we deal with uncertainty:

6.3.4 Modified Euler s method of integration

Lecture Nov

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

8.6 The Complex Number System

Bit Juggling. Representing Information. representations. - Some other bits. - Representing information using bits - Number. Chapter

CHEM 112 Exam 3 Practice Test Solutions

10-701/ Machine Learning, Fall 2005 Homework 3

Credit Card Pricing and Impact of Adverse Selection

= z 20 z n. (k 20) + 4 z k = 4

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

12. The Hamilton-Jacobi Equation Michael Fowler

Lecture 17 : Stochastic Processes II

Homework Notes Week 7

Linear Regression Analysis: Terminology and Notation

Advanced Quantum Mechanics

Sampling Theory MODULE VII LECTURE - 23 VARYING PROBABILITY SAMPLING

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Lecture 4: Universal Hash Functions/Streaming Cont d

Learning Theory: Lecture Notes

Supplementary Notes for Chapter 9 Mixture Thermodynamics

For example, if the drawing pin was tossed 200 times and it landed point up on 140 of these trials,

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Rockefeller College University at Albany

20. Mon, Oct. 13 What we have done so far corresponds roughly to Chapters 2 & 3 of Lee. Now we turn to Chapter 4. The first idea is connectedness.

AS-Level Maths: Statistics 1 for Edexcel

Errors for Linear Systems

Text S1: Detailed proofs for The time scale of evolutionary innovation

STATISTICAL MECHANICS

Expectation Maximization Mixture Models HMMs

An Explicit Construction of an Expander Family (Margulis-Gaber-Galil)

A particle in a state of uniform motion remain in that state of motion unless acted upon by external force.

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

COS 521: Advanced Algorithms Game Theory and Linear Programming

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

Transcription:

Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are some supplementary notes wth more detals. Hopefully they make thngs clearer. The man dea There are two man tems nvolved: - Translaton probabltes - Word algnments The translaton probabltes are assgned to the blngual lexcon: For a par of words (e,f) n the lexcon, how probable s t that e gets translated as f, expressed by f e). Beware, ths s calculated from the whole corpus; we do not consder these probabltes for a sngle sentence. A word algnment s assgned to a par of sentences (e, f). (We are usng bold face to ndcate that e s a strng (array) of words e, e, e k, etc.) When we have a parallel corpus where the sentences are sentence algned whch may be expressed by (e, f ), (e, f ),,(e m, f m ) we are consderng the algnment of each sentence par ndvdually. Ideally, we are lookng for the best algnment of each sentence. But as we do not know t, we wll nstead consder the probablty of the varous algnments of the sentence. For each sentence, the probablty of the varous algnments must add to. The EM-tranng then goes as follows. Intalzng a. We start wth ntalzng t. When we don t have other nformaton, we ntalze t unformly. That s, f e) /s, where s s the number of F-words n the lexcon. b. For each sentence n the corpus, we estmate the probablty dstrbuton for the varous algnments of the sentence. Ths s done on the bass of t, and should reflect t: For example, f f e k ) f j e k ), then algnments whch algns f to e k should be tmes more probable than they whch algn f j to e k. (Well, actually, n round ths s more trval, snce all algnments are equally lkely when we start wth a unform t.). Next round a. We count how many tmes a word e s translated as f on the bass of the probablty dstrbutons for the sentences. Ths s a fractonal count. Gven a sentence par (e j, f j ). If e occurs n e j. and f occurs n f j, we consder algnments whch algns f to e. Gven such an algnment, a, we consder ts probablty P(a), and from ths algnment we count that e s translated as f P(a) many tmes. For example, f P(a) s., we wll add. to the count of how many tmes e s translated as f. After we have done ths for all algnments of all the sentences, we can recalculate t. The notaton n Koehn s book for the dfferent counts and measures s not stellar, but as we adopted the same notaton n the sldes, we wll stck to t to make the smlartes transparent. Koehn used the notaton f e) for the fractonal count of the

par (e,f) n a partcular sentence. To make t clear that t s the count n the specfc sentence par (e, f), he also uses the notaton f e; f, e ). To ndcate the fractonal count of the word (type) par (e,f) over the whole corpus, he uses (f,e) f e; f, e ) (.e. we add the fractonal counts for all the sentences.) An alternatve notaton for the same would have been m f e; f, e ) gven there are m sentences n the corpus. We ntroduced the notaton tc for total count for ths on the sldes. tf e) f e; f, e) (f,e) The reestmated translaton probablty can then be calculated from ths f e) ( f, e ) f e; f, e) f e; f, e) Here f vares over all the F-words n the lexcon. b. Wth these new translaton probabltes, we may return to the algnments, and for each sentence estmate the best probablty dstrbuton over the possble algnments. Ths tme there s no smple way as there was n round. For each algnment, we calculate a probablty on the bass of t, and normalze to make sure that the sum of the probabltes for each sentence add up to. Next round: a. We go about exactly as n step (a). On the bass of the algnment probabltes estmated n step (b), we may now calculate new translatons probabltes t, b. And on the bass of the translaton probabltes estmate new algnment probabltes. And so we may repeat the two steps as long as we lke Propertes What s nce wth ths algorthm s: - We can prove that the result gets better (or stay the same) after each round. It never deterorates. - The result converges towards a local optmum. - For IBM model (but not n general) ths local optmum s also a global optmum. The fast way We have descrbed here the underlyng dea of the algorthm. The descrpton above s probably the best for understandng what s gong on. There s a problem when applyng t. There are so many (too many) dfferent algnments. We therefore derved a modfed algorthm where we do not calculate the probabltes of the actual algnments. Instead we calculate the translaton probablty n step (a) drectly from the translaton probabltes from step (a) and the translaton probabltes n step (a) drectly from the translaton probabltes n (a) wthout actually calculatng the ntermedate algnment probabltes (step b). f ( f, e) f ' t f e) t f ' e)

Examples There s a very smple example n Jurafsky and Martn whch llustrates the calculaton wth the orgnal algorthm. You should consult ths frst. In the example n the lecture, we followed the modfed algorthm where we sdestep the actual algnments. Let us now see how the example from the lecture would go wth the full algorthm frst (smlarly to the Jurafsky-Martn example), before we compare t to the example from the lecture wth some more detals flled n. We wll number the examples Sentence : - e : dog barked - f : hund bjeffet Sentence : - e : dog bt dog - f : hund bet hund to have the smplest example frst. The theoretcal sound, but computatonally ntractable way: Step a - Intalzaton. Snce there are Norwegan words all f e) s set to /. hund dog) / bet dog) / bjeffet dog) / hund bt) / bet bt) / bjeffet bt) / hund barked) / bet barked) / bjeffet barked) / hund ) / bet ) / bjeffet ) / Step b Algnments We must also nclude n the E-sentence to ndcate that a word n the F-sentence may be algned to nothng. Each of the words n the sentence f may come from one of dfferent words n sentence e. Hence there are 9 dfferent algnments: <,>, <,>, <,>, <,>, <,>, <,>, <,>, <,>, <,>. Snce all translaton probabltes are equally lkely, each algnment wll have the same probablty. Snce there are 9 dfferent algnments, each of them wll have the probablty /9. Wrtng a for the algnment probablty of the frst sentence, we have a (<,>) a (<,>) a (<,>)/9. For sentence, there are words n f.. Each of them may be algned to any of 4 dfferent words n e (ncludng ). Hence there are 4*4*464 dfferent algnments, rangng from <,,> to <,,>. We could take the easy way out and say that each of them s equally lkely, hence a (<,,>) a (<,,>) a (<,,>)/64. But to prepare our understandng for later rounds, let us see what happens f we follow the recpe. To calculate the probablty of one partcular algnment, we multply together the nvolved translaton probabltes, eg. P (<,,>) hund dog)*bet bt)*hund )/7. In ths round, we get exactly the same result for all the algnments, /7. But that sn t the same as /64. Has anythng gone wrong here? No. The score /7 s not the probablty of the algnment. To get at the probablty we must normalze. Frst we sum together the scores for all the algnments whch yelds 64/7. Then to get the probablty for each algnment, we must dvde the second wth ths sum. Hence the probablty for each algnment s (/7)/(64/7) a complcated way to wrte /64.

Step a Maxmze the translaton probabltes Then the show may start. We frst calculate the fractonal counts for the word pars n the lexcon, and we do ths sentence by sentence, startng wth sentence. To take one example, what s the fractonal count of (dog, hund) n sentence? We must see whch algnments whch algn the two words. There are : <,>, <,>, <,>. (A good advce at ths pont s to draw the algnments whle you read.) To get the fractonal count we must add the probabltes of these algnments,.e., hund dog; f, e ) a (<,>)+ a (<,>)+ a (<,>)*(/9) /. We can repeat for the par (hund, barked) and get hund barked; f, e ) a (<,>)+ a (<,>)+ a (<,>)*(/9) /, and so on. We see we get the same for all word pars n ths sentence hund dog) / bjeffet dog) / hund barked) / bjeffet barked) / hund ) / bjeffet ) / (There s a typo n the lecture sldes and n the frst verson of these notes, wrtng t nstead of c n the rght column. The same for sentence.) Sentence s more extng. Consder frst the par (bet, bt). They get algned by all algnments of the form <x,, y> where x and y are any of,,,. There are 6 such algnments. (We don t bother to wrte them out). Each algnment has probablty /64. Hence bet bt; f, e ) 6/64 ¼ Smlarly we get bet ; f, e ) 6/64 ¼. To count the par (dog, bet), they are algned by all algnments of the form <x,,y> and all algnments of the form <x,,y>, hence bet dog; f, e ) *6/64 ½ To count the par (bt, hund), we must consder both algnments of the form <,x,y> and of the form <x,y,>. (Observe that <,x,> should be counted twce snce two occurrences of hund are algned to bt.) And to count the par (hund, dog), we must consder all algnments <,x,y>, <,x,y>, <x,y,> and <x,y,>. We get the followng counts for sentence : hund dog) bet dog) / hund bt) ½ bet bt) /4 hund ) / bet ) /4 4

We get the total counts (tc) by addng the fractonal counts for all the sentences n the corpus resultng n thund dog) +/ tbet dog) / tbjeffet dog) / t* dog)4/+/+/ /6 thund bt) ½ tbet bt) ¼ tbjeffet bt) t* bt)/4 thund barked) / tbet barked) tbjeffet barked) / t* barked) / thund ) ½+/ tbet ) /4 tbjeffet ) / t* )7/ In the last column we have added all the total counts for one E word, e.g. t dog) tf e; f, e ) f We can then fnally calculate the new translaton probabltes: e f f e) exact decmal hund (5/6)/(7/)/7.5885 bet (/4)/(7/)/7.7647 bjeffet (/)/(7/)4/7.594 dog hund (4/)/(/6) 8/.6585 dog bet (/)/(/6) /.769 dog bjeffet (/)/(/6) /.5846 bt hund (/)/(/4) /.666667 bt bet (/4)/(/4) /. barked hund (/)/(/ /.5 barked bjeffet (/)/(/) /.5 5

Step b_ Estmate algnment probabltes It s tme to estmate the algnment probabltes agan. Remember ths s done sentence by sentence, startng wth sentence. There are 9 dfferent algnments to consder. For each of them we may calculate an ntal unnormalzed probablty, call t P, on the bass of the last translaton probabltes. P PP /,44546 P (<,>) hund )*bjeffet ) (/7)*(/7),86,7848 P (<,>) hund )*bjeffet dog) (/7)*(/),94977,69766 P (<,>) hund )*bjeffet barked) (/7)*(/),948,794 P (<,>) hund dog)*bjeffet ) (8/)*(/7),8597,76778 P (<,>) hund dog)*bjeffet dog) (8/)*(/),946746,66994 P (<,>) hund dog)*bjeffet barked) (8/)*(/),769,75 P (<,>) hund barked)*bjeffet ) (/)*(/7),885,6775 P (<,>) hund barked)*bjeffet dog) (/)*(/),769,5485 P (<,>) hund barked)*bjeffet barked) (/)*(/),5,7675 Sum of P s,44546 We sum the P scores (last lne) and normalze them n the last column to get the probablty dstrbuton over the algnments. We may do the same for sentence. But because there are 64 dfferent algnments we refran from carryng out the detals. Step a Maxmze the translaton probabltes We proceed exactly as n step a. We frst collect the fractonal counts sentence by sentence, startng wth sentence. For example, we get hund barked; f, e ) a (<,>)+ a (<,>)+ a (<,>),6775+,5485+,7675,949 And smlarly for the other fractonal counts n sentence. Snce we have not calculated the algnments for sentence, we stop here. Hopefully the dea s clear by now. The fast lane Manually we refran from calculatng 64 algnments, but t wouldn t have been a problem for a machne. However, a short sentence of words has roughly algnments and soon also the machnes must gve n. Let us repeat the calculatons from the sldes from the lecture. The pont s that we skp the algnments and pass drectly from step a to step a and then to step a etc. The key s the formula m k f e) f e; e, f ) δ ( f, f ) δ ( e, e ) whch lets us calculate fractonal counts drectly from (last round of) translaton probabltes. k j f e j ) 6

Step a Maxmze the translaton probabltes To understand the formula, f j refers to the word at poston j n sentence f. Thus n sentence, f f s hund, δ f, f j for j, whle, δ f, f j for j. Smlarly, e refers to the word n poston n the Englsh strng. Hence, hund barked) hund barked; e, f) (, ) (, ) δ hund f j δ barked e f e ) j ( δ ( hund, hund) + δ ( hund, bjeffet)) ( ) ( δ ( barked,) + δ ( barked, dog) + δ ( barked, barked)) / and smlarly for the other word pars. We get the same fractonal counts for sentence as when we used explct algnments. Then sentence. To take two examples bet bt) bet bt; e, f ) (, ) (, ) / 4 δ bet f j δ bt e bet e ) j ( ) hund dog) c hund dog; e, f ) (, ) (, ) δ hund f j δ dog e hund e ) j ( ) ( Hurray we get the same fractonal counts as wth the explct use of algnments. And we may proceed as we dd there, calculatng frst the total fractonal counts, tc, and then the translaton probabltes, t. Step a Maxmze the translaton probabltes We can harvest the award when we come to the next round and want to calculate the fractonal counts. Take an example from sentence : hund barked; e, f hund barked) ) (, ) (, ) δ hund f j δ barked e f e ) j.5.5.9497.5885 +.6585 +.5.76 Whch s close enough to the result we got by takng the long route (gven that we use calculator and round off for each round). The mracle s that ths works equally well on sentence, for example: hund dog; e, f hund dog) ) (, ) (, ) δ hund f j δ dog e hund e ) j.6585?.5885 +.6585 +.666667 +.6585 7

Summng up Ths concludes the examples. Hopefully t s now possble to better see: The motvaton between the orgnal approach where we explctly calculate algnments That the faster algorthm yelds the same results as the orgnal algorthm, at least on the example we explctly calculated. And that even though t may be hard to see step by step that the two algorthms produce the same results n general, we may open up to the dea. That the fast algorthm s computatonally tractable. 8