EM with Features. Nov. 19, Sunday, November 24, 13

Similar documents
Word Alignment. Chris Dyer, Carnegie Mellon University

Lexical Translation Models 1I. January 27, 2015

Lexical Translation Models 1I

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

IBM Model 1 for Machine Translation

Graphical models for part of speech tagging

IBM Model 1 and the EM Algorithm

CRF Word Alignment & Noisy Channel Translation

Hidden Markov Models

Midterm sample questions

CSE446: Clustering and EM Spring 2017

TDA231. Logistic regression

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Hidden Markov Models and Gaussian Mixture Models

CPSC 540: Machine Learning

Sequences and Information

Logistic Regression. Machine Learning Fall 2018

Basic math for biology

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Lecture 3: Pattern Classification

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probabilistic Graphical Models

K-Means and Gaussian Mixture Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Expectation Maximization Algorithm

Gaussian Mixture Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Applications of Hidden Markov Models

Logistic Regression. Seungjin Choi

Graphical Models for Collaborative Filtering

CSC 411: Lecture 04: Logistic Regression

Machine Learning for Signal Processing Bayes Classification

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Expectation maximization

Log-linear models (part 1)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Graphical Models

Linear Dynamical Systems (Kalman filter)

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Probabilistic Graphical Models

A brief introduction to Conditional Random Fields

Introduction to Machine Learning Midterm, Tues April 8

Statistical Machine Learning from Data

Probabilistic Graphical Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Expectation Maximization (EM)

Overfitting, Bias / Variance Analysis

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Neural Networks in Structured Prediction. November 17, 2015

Information Extraction from Text

Gaussian Mixture Models, Expectation Maximization

Machine Learning Lecture 5

Undirected Graphical Models

Directed Probabilistic Graphical Models CMSC 678 UMBC

CPSC 340 Assignment 4 (due November 17 ATE)

Pattern Recognition and Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Introduction to Graphical Models

CSE 490 U Natural Language Processing Spring 2016

Spectral Learning for Non-Deterministic Dependency Parsing

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Gibbs Sampling Methods for Multiple Sequence Alignment

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Lecture 9: PGM Learning

HMM part 1. Dr Philip Jackson

Hidden Markov Models in Language Processing

Machine Learning

A minimalist s exposition of EM

Language and Statistics II

Expectation maximization tutorial

Hidden Markov Models

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Hidden Markov Models and Gaussian Mixture Models

Log-Linear Models, MEMMs, and CRFs

Probability and Statistics

Variational Decoding for Statistical Machine Translation

Introduction to Restricted Boltzmann Machines

The Infinite PCFG using Hierarchical Dirichlet Processes

Hidden Markov Models. Terminology, Representation and Basic Problems

Machine Learning, Fall 2012 Homework 2

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Transcription:

EM with Features Nov. 19, 2013

Word Alignment das Haus ein Buch das Buch the house a book the book

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f = hf 1,f 2,...,f n i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word. f ai

IBM Model 1 Simplest possible lexical translation model Additional assumptions The m alignment decisions are independent The alignment distribution for each a i over all source words and NULL for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) is uniform

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai ) p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e, a f,m)= my p(e i,a i f,m) i=1

Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(a, b, c, d) =p(a)p(b)p(c)p(d)

Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(e f,m)= my i=1 p(e i f,m)

Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1

Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

Example 0 NULL das Haus ist klein Start with a foreign sentence and a target length.

Example 0 NULL das Haus ist klein

Example 0 NULL das Haus ist klein

Example 0 NULL das Haus ist klein the

Example 0 NULL das Haus ist klein the

Example 0 NULL das Haus ist klein the house

Example 0 NULL das Haus ist klein the house

Example 0 NULL das Haus ist klein the house is

Example 0 NULL das Haus ist klein the house is

Example 0 NULL das Haus ist klein the house is small

Example 0 NULL das Haus ist klein

Example 0 NULL das Haus ist klein

Example 0 NULL das Haus ist klein the

Example 0 NULL das Haus ist klein the

Example 0 NULL das Haus ist klein house the

Example 0 NULL das Haus ist klein house the

Example 0 NULL das Haus ist klein house is the

Example 0 NULL das Haus ist klein house is the

Example 0 NULL das Haus ist klein house is small the

freq(buch, book) =? freq(das, book) =? freq(ein, book) =? freq(buch, book) = X I(ẽ i = book, f ai i = Buch) E pw (1) (a f=das Buch,e=the book) X I[e i = book,f ai = Buch] i

Convergence

Evaluation Since we have a probabilistic model, we can evaluate perplexity. PPL = 2 1 P(e,f)2D e log Q (e,f)2d p(e f) Iter 1 Iter 2 Iter 3 Iter 4... Iter -log likelihood perplexity - 7.66 7.21 6.84... -6-2.42 2.30 2.21... 2

Hidden Markov Models GMMs, the aggregate bigram model, and Model 1 don t have conditional dependencies between random variables Let s consider an example of a model where this is not the case p(x) = X y 0 2Y x (y x! stop) x Y i=1 (y i 1! y i ) (y i # x i )

EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q?

EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q? And of course... freq(q) = X r2q freq(q! r)

a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x)

p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)]

p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)] What is E[freq(q # x)]?

Random Restarts Non-convex optimization only finds a local solution Several strategies Random restarts Simulated annealing

Decipherment

Grammar Induction

Inductive Bias A model can learn nothing without inductive bias... whence inductive bias? Model structure Priors (next week) Posterior regularization (Google it) Features provide a very flexible means to bias a model

EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

Learning Algorithm I E step given model parameters, compute posterior distribution over transitions (states, etc) compute f(q, r) q,r These are your empirical expectations E q(y) X

Learning Algorithm 1 M step The gradient of the expected log likelihood of x,y under q(y) is re q(y) log p(x, y) =E q X f(q, r) X E q [freq(q)]e p(r q;w) f(q, r) q,r q,r Use LBFGS or gradient descent to solve