EM with Features. Nov. 19, Sunday, November 24, 13

Size: px
Start display at page:

Download "EM with Features. Nov. 19, Sunday, November 24, 13"

Transcription

1 EM with Features Nov. 19, 2013

2 Word Alignment das Haus ein Buch das Buch the house a book the book

3 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

4 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

5 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f e = he 1,e 2,...,e m i Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f = hf 1,f 2,...,f n i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word.

6 Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical translation makes the following assumptions: Each word in e i in e is generated from exactly one word in f Thus, we have an alignment a i that indicates which word came from, specifically it came from. e i f ai Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word. f ai

7 IBM Model 1 Simplest possible lexical translation model Additional assumptions The m alignment decisions are independent The alignment distribution for each a i over all source words and NULL for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) is uniform

8 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1

9 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n

10 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

11 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai )

12 IBM Model 1 for each i 2 [1, 2,...,m] a i Uniform(0, 1, 2,...,n) e i Categorical( fai ) p(e, a f,m)= my i=1 1 1+n p(e i f ai ) p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e, a f,m)= my p(e i,a i f,m) i=1

13 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

14 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(a, b, c, d) =p(a)p(b)p(c)p(d)

15 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other.

16 Marginal probability p(e i,a i f,m)= 1 p(e i f,m)= 1+n p(e i f ) ai nx 1 1+n p(e i f ) ai a i =0 Recall our independence assumption: all alignment decisions are independent of each other, and given alignments all translation decisions are independent of each other, so all translation decisions are independent of each other. p(e f,m)= my i=1 p(e i f,m)

17 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1

18 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

19 Marginal probability p(e i,a i f,m)= 1 1+n p(e i f ) ai p(e i f,m)= p(e f,m)= = = nx 1 1+n p(e i f ) ai my p(e i f,m) a i =0 i=1 my i=1 a i =0 nx 1 1 (1 + n) m 1+n p(e i f ai ) my nx p(e i f ) ai i=1 a i =0

20 Example 0 NULL das Haus ist klein Start with a foreign sentence and a target length.

21 Example 0 NULL das Haus ist klein

22 Example 0 NULL das Haus ist klein

23 Example 0 NULL das Haus ist klein the

24 Example 0 NULL das Haus ist klein the

25 Example 0 NULL das Haus ist klein the house

26 Example 0 NULL das Haus ist klein the house

27 Example 0 NULL das Haus ist klein the house is

28 Example 0 NULL das Haus ist klein the house is

29 Example 0 NULL das Haus ist klein the house is small

30 Example 0 NULL das Haus ist klein

31 Example 0 NULL das Haus ist klein

32 Example 0 NULL das Haus ist klein the

33 Example 0 NULL das Haus ist klein the

34 Example 0 NULL das Haus ist klein house the

35 Example 0 NULL das Haus ist klein house the

36 Example 0 NULL das Haus ist klein house is the

37 Example 0 NULL das Haus ist klein house is the

38 Example 0 NULL das Haus ist klein house is small the

39 freq(buch, book) =? freq(das, book) =? freq(ein, book) =? freq(buch, book) = X I(ẽ i = book, f ai i = Buch) E pw (1) (a f=das Buch,e=the book) X I[e i = book,f ai = Buch] i

40 Convergence

41 Evaluation Since we have a probabilistic model, we can evaluate perplexity. PPL = 2 1 P(e,f)2D e log Q (e,f)2d p(e f) Iter 1 Iter 2 Iter 3 Iter 4... Iter -log likelihood perplexity

42 Hidden Markov Models GMMs, the aggregate bigram model, and Model 1 don t have conditional dependencies between random variables Let s consider an example of a model where this is not the case p(x) = X y 0 2Y x (y x! stop) x Y i=1 (y i 1! y i ) (y i # x i )

43 EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q?

44 EM for HMMs What statistics are sufficient to determine the parameter values? freq(q # x) How often does q emit x? freq(q! r) How often does q transition to r? freq(q) How often do we visit q? And of course... freq(q) = X r2q freq(q! r)

45 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

46 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

47 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

48 a b b c b 2 (s 2 ) 3(s 2 ) i=1$ i=2$ i=3$ i=4$ i=5$ p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop)

49 p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x)

50 p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)]

51 p(y 2 = q, y 3 = r x) / p(y 2 = q, y 3 = r, x) = 2 (q) 3 (r) (q! r) (r # x 3 ) P q 0,r 0 2Q 2(q 0 ) 3 (r 0 ) (q 0! r 0 ) (r 0 # x 3 ) = 2(q) 3 (r) (q! r) (r # x 3 ) p(x) = x (stop) The expectation over the full structure is then E[freq(q! r)] = x X i=1 p(y i = q, y i+1 = r x) The expectation over state occupancy is E[freq(q)] = X r2q E[freq(q! r)] What is E[freq(q # x)]?

52 Random Restarts Non-convex optimization only finds a local solution Several strategies Random restarts Simulated annealing

53 Decipherment

54 Grammar Induction

55 Inductive Bias A model can learn nothing without inductive bias... whence inductive bias? Model structure Priors (next week) Posterior regularization (Google it) Features provide a very flexible means to bias a model

56 EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

57 EM with Features Let s replace the multinomials with log linear distributions (q! r) = q,r = P exp w > f(q, r) q 0 2Q exp w> f(q 0,r) How will the likelihood of this model compare to the likelihood of the previous model?

58 Learning Algorithm I E step given model parameters, compute posterior distribution over transitions (states, etc) compute f(q, r) q,r These are your empirical expectations E q(y) X

59 Learning Algorithm 1 M step The gradient of the expected log likelihood of x,y under q(y) is re q(y) log p(x, y) =E q X f(q, r) X E q [freq(q)]e p(r q;w) f(q, r) q,r q,r Use LBFGS or gradient descent to solve

Word Alignment. Chris Dyer, Carnegie Mellon University

Word Alignment. Chris Dyer, Carnegie Mellon University Word Alignment Chris Dyer, Carnegie Mellon University John ate an apple John hat einen Apfel gegessen John ate an apple John hat einen Apfel gegessen Outline Modeling translation with probabilistic models

More information

Lexical Translation Models 1I. January 27, 2015

Lexical Translation Models 1I. January 27, 2015 Lexical Translation Models 1I January 27, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { p(e f,m)=

More information

Lexical Translation Models 1I

Lexical Translation Models 1I Lexical Translation Models 1I Machine Translation Lecture 5 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time... X p( Translation)= p(, Translation)

More information

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Word Alignment III: Fertility Models & CRFs. February 3, 2015 Word Alignment III: Fertility Models & CRFs February 3, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } {

More information

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

IBM Model 1 and the EM Algorithm

IBM Model 1 and the EM Algorithm IBM Model 1 and the EM Algorithm Philipp Koehn 14 September 2017 Lexical Translation 1 How to translate a word look up in dictionary Haus house, building, home, household, shell. Multiple translations

More information

CRF Word Alignment & Noisy Channel Translation

CRF Word Alignment & Noisy Channel Translation CRF Word Alignment & Noisy Channel Translation January 31, 2013 Last Time... X p( Translation)= p(, Translation) Alignment Alignment Last Time... X p( Translation)= p(, Translation) Alignment X Alignment

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

TDA231. Logistic regression

TDA231. Logistic regression TDA231 Devdatt Dubhashi dubhashi@chalmers.se Dept. of Computer Science and Engg. Chalmers University February 19, 2016 Some data 5 x2 0 5 5 0 5 x 1 In the Bayes classifier, we built a model of each class

More information

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Infering the Number of State Clusters in Hidden Markov Model and its Extension Infering the Number of State Clusters in Hidden Markov Model and its Extension Xugang Ye Department of Applied Mathematics and Statistics, Johns Hopkins University Elements of a Hidden Markov Model (HMM)

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Statistical NLP Spring Corpus-Based MT

Statistical NLP Spring Corpus-Based MT Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein UC Berkeley Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it

More information

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1 Statistical NLP Spring 2010 Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm 6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Applications of Hidden Markov Models

Applications of Hidden Markov Models 18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples Statistical NLP Spring 2009 Machine Translation: Examples Lecture 17: Word Alignment Dan Klein UC Berkeley Corpus-Based MT Levels of Transfer Modeling correspondences between languages Sentence-aligned

More information

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011 Discrimina)ve Latent Variable Models SPFLODD November 15, 2011 Lecture Plan 1. Latent variables in genera)ve models (review) 2. Latent variables in condi)onal models 3. Latent variables in structural SVMs

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

Log-linear models (part 1)

Log-linear models (part 1) Log-linear models (part 1) CS 690N, Spring 2018 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2018/ Brendan O Connor College of Information and Computer Sciences University

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Scott W. Linderman Matthew J. Johnson Andrew C. Miller Columbia University Harvard and Google Brain Harvard University Ryan

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

CPSC 340 Assignment 4 (due November 17 ATE)

CPSC 340 Assignment 4 (due November 17 ATE) CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

Spectral Learning for Non-Deterministic Dependency Parsing

Spectral Learning for Non-Deterministic Dependency Parsing Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque 1 Ariadna Quattoni 2 Borja Balle 2 Xavier Carreras 2 1 Universidad Nacional de Córdoba 2 Universitat Politècnica de Catalunya

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters

More information

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs 1 HMM is useful for many, many problems. Speech Recognition and Translation Weather Modeling Sequence

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

HMM part 1. Dr Philip Jackson

HMM part 1. Dr Philip Jackson Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM part 1 Dr Philip Jackson Probability fundamentals Markov models State topology diagrams Hidden Markov models -

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Language and Statistics II

Language and Statistics II Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma

More information

Expectation maximization tutorial

Expectation maximization tutorial Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Statistical NLP Spring HW2: PNP Classification

Statistical NLP Spring HW2: PNP Classification Statistical NLP Spring 2010 Lecture 16: Word Alignment Dan Klein UC Berkeley HW2: PNP Classification Overall: good work! Top results: 88.1: Matthew Can (word/phrase pre/suffixes) 88.1: Kurtis Heimerl (positional

More information

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment Statistical NLP Spring 2010 Lecture 16: Word Alignment Dan Klein UC Berkeley HW2: PNP Classification Overall: good work! Top results: 88.1: Matthew Can (word/phrase pre/suffixes) 88.1: Kurtis Heimerl (positional

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Probability and Statistics

Probability and Statistics Probability and Statistics January 17, 2013 Last time... 1) Formulate a model of pairs of sentences. 2) Learn an instance of the model from data. 3) Use it to infer translations of new inputs. Why Probability?

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

The Infinite PCFG using Hierarchical Dirichlet Processes

The Infinite PCFG using Hierarchical Dirichlet Processes S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise

More information

Hidden Markov Models. Terminology, Representation and Basic Problems

Hidden Markov Models. Terminology, Representation and Basic Problems Hidden Markov Models Terminology, Representation and Basic Problems Data analysis? Machine learning? In bioinformatics, we analyze a lot of (sequential) data (biological sequences) to learn unknown parameters

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Machine Translation II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Project 4: Word Alignment! Will be released soon! (~Monday) Phrase-Based System Overview

More information