Word Alignment III: Fertility Models & CRFs. February 3, 2015

Similar documents
CRF Word Alignment & Noisy Channel Translation

Lexical Translation Models 1I. January 27, 2015

Lexical Translation Models 1I

EM with Features. Nov. 19, Sunday, November 24, 13

Word Alignment. Chris Dyer, Carnegie Mellon University

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Expectation Maximization (EM)

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Conditional Random Fields

Bayesian Networks BY: MOHAMAD ALSABBAGH

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models

Probability Review. September 25, 2015

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Markov Networks.

Probabilistic Graphical Models

A brief introduction to Conditional Random Fields

GAUSSIAN PROCESS REGRESSION

Out of GIZA Efficient Word Alignment Models for SMT

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Graphical models for part of speech tagging

Bayes Nets: Sampling

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Probability and Statistics

Naïve Bayes classification

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Lecture 6: Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Introduction to Machine Learning

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Integrating Morphology in Probabilistic Translation Models

Discriminative Training

MIA - Master on Artificial Intelligence

3 : Representation of Undirected GM

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Midterm sample questions

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Methods: Naïve Bayes

Hidden Markov Models

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Log-Linear Models, MEMMs, and CRFs

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COMS 4705, Fall Machine Translation Part III

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Lab 12: Structured Prediction

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Lecture 15. Probabilistic Models on Graph

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

CS 343: Artificial Intelligence

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Latent Variable Models

A minimalist s exposition of EM

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Word Alignment via Submodular Maximization over Matroids

Study Notes on the Latent Dirichlet Allocation

Bayesian Networks Structure Learning (cont.)

Approximate Inference

Phrase-Based Statistical Machine Translation with Pivot Languages

4 : Exact Inference: Variable Elimination

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

IBM Model 1 for Machine Translation

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Statistical Pattern Recognition

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

IBM Model 1 and the EM Algorithm

Undirected Graphical Models: Markov Random Fields

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 13: Structured Prediction

Information Extraction from Text

Conditional Language Modeling. Chris Dyer

Lecture 16 Deep Neural Generative Models

Linear Dynamical Systems (Kalman filter)

Unsupervised Learning

Hidden Markov Models. x 1 x 2 x 3 x K

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

The Noisy Channel Model and Markov Models

Bayesian Networks Inference with Probabilistic Graphical Models

Natural Language Processing (CSEP 517): Machine Translation

Graphical Models and Kernel Methods

Alternative Parameterizations of Markov Networks. Sargur Srihari

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Variational Decoding for Statistical Machine Translation

Conditional Random Field

Intelligent Systems (AI-2)

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic Graphical Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

MAP Examples. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mixtures of Gaussians continued

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Transcription:

Word Alignment III: Fertility Models & CRFs February 3, 2015

Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai )

Fertility Models The models we have considered so far have been efficient This efficiency has come at a modeling cost: What is to stop the model from translating a word 0, 1, 2, or 100 times? We introduce fertility models to deal with this

IBM Model 3

Fertility Fertility: the number of English words generated by a foreign word Modeled by categorical distribution Examples: n( f) Unabhaengigkeitserklaerung zum = (zu + dem) Haus 0 0.00008 1 0.1 2 0.0002 3 0.8 4 0.009 5 0 0 0.01 1 0 2 0.9 3 0.0009 4 0.0001 5 0 0 0.01 1 0.92 2 0.07 3 0 4 0 5 0

Fertility X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) Fertility models mean that we can no longer exploit conditional independencies to write p(a f,m) as a series of local alignment decisions. How do we compute the statistics required for EM training?

EM Recipe reminder If alignment points were visible, training fertility models would be easy We would and n( =3 f = Unabhaenigkeitserklaerung) = count(3, Unabhaenigkeitserklaerung) count(unabhaenigkeitserklaerung) But, alignments are not visible n( =3 f = Unabhaenigkeitserklaerung) = E[count(3, Unabhaenigkeitserklaerung)] E[count(Unabhaenigkeitserklaerung)]

Expectation & Fertility We need to compute expected counts under p(a f,e,m) Unfortunately p(a f,e,m) doesn t factorize nicely. :( Can we sum exhaustively? How many different a s are there? What to do?

Sample Alignments Monte-Carlo methods Gibbs sampling Importance sampling Particle filtering For historical reasons Use model 2 alignment to start (easy!) Weighted sum over all alignment configurations that are close to this alignment configuration Is this correct? No! Does it work? Sort of.

Pitfalls of Conditional Models IBM Model 4 alignment Our model's alignmen

Lexical Translation IBM Models 1-5 [Brown et al., 1993] Model 3: fertility Model 5: non-deficient model Widely used Giza++ toolkit Model 1: lexical translation, uniform alignment Model 2: absolute position model Model 4: relative position model (jumps in target string) HMM translation model [Vogel et al., 1996] Relative position model (jumps in source string) Latent variables are more useful these days than the translations

A few tricks... p(f e) p(e f)

A few tricks... p(f e) p(e f)

A few tricks... p(f e) p(e f)

Alignment Tool: fast_align

Another View With this model: X my p(e f,m)= a2[0,n] m p(a f,m) i=1 p(e i f ai ) The problem of word alignment is as: a = arg max p(a e, f,m) a2[0,n] m Can we model this distribution directly?

Markov Random Fields (MRFs) A B C X Y Z p(a, B, C, X, Y, Z) = p(a) p(b A) p(c B) p(x A)p(Y B)p(Z C) A B C X Y Z p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) 4(X) 5(Y ) 6(Z) Factors

Computing Z X X Y Z = X x2x 1(x, y) 2(x) 3(y) y2x X = {a, b, c} X 2 X Y 2 X When the graph has certain structures (e.g., chains), you can factor to get polytime DP algorithms. Z = X x2x 2(x) X y2x 1(x, y) 3(y)

1,2,3(x, y) =exp X k Log-linear models A B C p(a, B, C, X, Y, Z) = 1 Z 1(A, B) 2(B,C) 3(C, D) X Y Z 4(X) 5(Y ) 6(Z) w k f k (x, y) Weights (learned) Feature functions (specified)

Random Fields Benefits Potential functions can be defined with respect to arbitrary features (functions) of the variables Great way to incorporate knowledge Drawbacks Likelihood involves computing Z Maximizing likelihood usually requires computing Z (often over and over again!)

Conditional Random Fields Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the input p(y x) = 1 Z w (x) exp X F 2G y X w k f k (F, x, y) k All factors in the graph of y

Parameter Learning CRFs are trained to maximize conditional likelihood Y ŵ MLE = arg max p(y i x i ; w) w Recall we want to directly model p(a e, f) (x i,y i )2D The likelihood of what alignments? Gold reference alignments!

CRF for Alignment One of many possibilities, due to Blunsom & Cohn (2006) p(a e, f) = 1 Z w (e, f) exp e X i=1 X a has the same form as in the lexical translation models (still make a one-to-many assumption) w k are the model parameters f k are the feature functions k w k f(a i,a i 1,i,e, f) O(n 2 m) O(n 3 )

Model Labels (one per target word) index source sentence Train model (e,f) and (f,e) [inverting the reference alignments]

Experiments

pervez musharrafs langer abschied Identical word pervez musharraf s long goodbye Identical word 27

pervez musharrafs langer abschied Matching prefix pervez musharraf s long goodbye Identical word Matching prefix 28

pervez musharrafs langer abschied Matching suffix pervez musharraf s long goodbye Identical word Matching prefix Matching suffix 29

pervez musharrafs langer abschied Orthographic similarity pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 30

pervez musharrafs langer abschied In dictionary pervez musharraf s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity In dictionary... 31

Lexical Features Word word indicator features Various word word co-occurrence scores IBM Model 1 probabilities (t s, s t) Geometric mean of Model 1 probabilities Dice s coefficient [binned] Products of the above

Lexical Features Word class word class indicator NN translates as NN NN does not translate as MD Identical word feature 2010 = 2010 Identical prefix feature Obama ~ Obamu (NN_NN=1) (NN_MD=1) (IdentWord=1 IdentNum=1) (IdentPrefix=1) Orthographic similarity measure [binned] Al-Qaeda ~ Al-Kaida (OrthoSim050_080=1)

Other Features Compute features from large amounts of unlabeled text Does the Model 4 alignment contain this alignment point? What is the Model 1 posterior probability of this alignment point?

CRF Results AER P R French English 9% 97% 86% French English 9% 98% 83% French English 7% 96% 90% French English+M4 7% 98% 88% French English+M4 7% 98% 87% French English+M4 5% 98% 91% IBM Model 4 9% 87% 95%

Summary Unfortunately, you need gold alignments!

CRF Autoencoders X! input! X X i-1!! X i!! X X i+1!! encode Y! latent! Y i-1!! Y i!! Y Y i+1!! X! X! X! i-1! i! i+1! reconstruct reconstruc,on! X! X arg max, x log X y2y(x) p (y x) p (x 0 y) CRF Encoder reconstruction model Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders

Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders X arg max, x log X y2y(x) p (y x) p (x 0 y) arg max, X (e,f) log X a2a(e,f) p (a e, f) CRF Aligner my i=1 p (e i f aj ) lexical translation probabilities

Ammar, Dyer, Smith. (2014) Conditional Random Field Autoencoders CRF Autoencoders AER P R Czech English 28% 71% 73% Czech English 21% 80% 77% Czech English 19% 81% 81% IBM Model 4 22% 75% 80%

Summary This is it for word alignment- questions? Next time: evaluation Keep working on HW1