Expectation Maximization (EM)

Similar documents
Expectation Maximization (EM)

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 15. Probabilistic Models on Graph

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Graphical Models for Collaborative Filtering

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Statistical Methods for NLP

Introduction to Machine Learning CMU-10701

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models and Gaussian Mixture Models

Probabilistic Context-Free Grammar

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

A brief introduction to Conditional Random Fields

Hidden Markov Models Part 2: Algorithms

The Expectation-Maximization Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Density Estimation: ML, MAP, Bayesian estimation

Statistical Pattern Recognition

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Note Set 5: Hidden Markov Models

Probabilistic Graphical Models

10-701/15-781, Machine Learning: Homework 4

Lecture 9: PGM Learning

Weighted Finite-State Transducers in Computational Biology

Bayesian Learning (II)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

order is number of previous outputs

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

The Infinite PCFG using Hierarchical Dirichlet Processes

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Methods for NLP

Sequential Supervised Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Modelling

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Lecture 3: Pattern Classification

10708 Graphical Models: Homework 2

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Study Notes on the Latent Dirichlet Allocation

Hidden Markov Models

Hidden Markov Models

Statistical Methods for NLP

The Noisy Channel Model and Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Machine Learning for Data Science (CS4786) Lecture 12

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Naïve Bayes classification

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models in Language Processing

Lecture 21: Spectral Learning for Graphical Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Notes on Machine Learning for and

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

STA 4273H: Statistical Machine Learning

Intelligent Systems (AI-2)

Statistical learning. Chapter 20, Sections 1 4 1

Collapsed Variational Bayesian Inference for Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Variational Decoding for Statistical Machine Translation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSCE 478/878 Lecture 6: Bayesian Learning

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Intelligent Systems:

Machine Learning Techniques for Computer Vision

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Conditional Random Fields for Sequential Supervised Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Lecture 4: Probabilistic Learning

Intelligent Systems (AI-2)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Unsupervised Learning

Automatic Speech Recognition (CS753)

COMP90051 Statistical Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Chapter 14 (Partially) Unsupervised Parsing

Hidden Markov Models

p(d θ ) l(θ ) 1.2 x x x

Machine Learning for Signal Processing Bayes Classification and Regression

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Hidden Markov Models

Linear Dynamical Systems

Transcription:

Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted with training using labeled data where all variables in the model are observed in the training data. For example, in labeled training of HMMs one assumes that one has pairs of observable sequences and the associated hidden state sequences. In unlabed training of HMMs the training data consists of just the observation sequences. EM is a training algorithm for the unlabeled case. Learning latent variable models from unlabeled data is very challenging and has had rather poor results in practice for HMMs and PCFGs. Training has been much more succcessful for partially labeled data where we have a structured label but lack an alignment between the structured input and the structured label. Automatic alignment is important for training speech recognition systems and machine translation systems. The use of EM for alignment is discussed in section 6. EM is perhaps easiest to understand through particular examples. The following sections give three special cases of the algorithm before presenting the general case. Mixtures of Gaussians For µ R d and Σ a d d matrix, we let N (µ, Σ) be the multivariate Gaussian distribution with mean µ and covariance matrix Σ. N (µ, Σ)(x) = (2π) d/2 Σ /2 exp( 2 (x µ)σ (x µ)) A mixture of K Gaussians is defined by a system of parameters = λ, µ, Σ,..., λ K, µ K, Σ K where we require that λ,..., λ k is a convex combination, i.e., λ i 0 and K i= λ i =. The probability model defined by parameter system is as follows. P (x) = K λ i N (µ i, Σ i )(x) () i= Now suppose that we have a sample x,... x T with x t R d drawn IID from some distribution D. Here we are interested in the following optimization problem. = argmin = argmax T P (x t ) (2) T P (x t ) (3)

Equation (2) defines maximum likelihood density estimation. It is maximum likielyhood (ML) rather than MAP (maximum a-posteriori probability) because there is no regularization term such as λ 2. Maximum Likelihood is appropriate if T is large compared to the number of parameters being trained. We can interpret () as derived from a latent variable y {,..., K} where we have the following. P (y = i) = λ i (4) Equation () can then be interpreted as follows. P (x y) = N (µ y, Σ y )(x) (5) P (x) = = K P (y)p (x y) (6) y= K P (x, y) (7) y= We can see that P (x, y) is a generative model (a Bayesian network). It first generates y and then generates x from y using conditional probabilities. Now suppose that we had labeled training data S = x, y,..., x T, y T. In generative training for labeled data we set the parameter vector as follows. = argmin T P (x t, y t ) (8) The solution is then given by count ratios and empirical averages as follows. λ i = T T Iy t = i (9) µ i = x tiy t = i Iy t = i (0) (Σ i ) j,k = (x t µ i ) j (x t µ i ) k Iy t = i Iy t = i () In the EM algorithm one constructs a series of parameter values 0,, 2,.... We will ignore how to construct the intitial value 0. In the EM algorithm one uses the distribution P (y t x t ) as a substitute for labels where P (y t x t ) can be computed by Bayes rules from (4) and (5). In EM for Gaussian mixtures we compute n+ from n by using a variant of (9), (0), and () in which the indicator function Iy t = i is replaced by the probability P n(y t = i x t ). 2

λ n+ i = T T P n(y t = i x t ) (2) µ n+ i = (Σ n+ i ) j,k = x tp n(y t = i x t ) P n(y t = i x t ) (x t µ i ) j (x t µ i ) k P n(y t = i x t ) P n(y t = i x t ) (3) (4) 2 EM for HMMs In HMMS we have that x t and y t are both sequences x t,..., x Nt t and yt,..., y Nt Here we have x i t O where O is a set of possible observations and yt i S where S is a set of possible hidden states. For s S we have a parameter Π s specifying P (yt = s). For s, w S we have a parameter T s,w specifying P (yt i+ = w yt i = s). Finally for s S and u O we have a parameter O s,u specifying P (x i t = u yt i = s). This gives the following joint probability. P (x, y) = Π y N i= T yi,y i+ N i= t. O yi,x i (5) Now suppose we had labeled traingin data x, y,..., x T, y T. For HMMs the generative training equation (8) has a closed form solution as follows. Π s = Iy t = s (6) T Nt Ts,w i= Iyt i = s yt i+ = w = Nt (7) t i= Iyt i = s Nt Os,u i= = Iyi t = s x i+ t = u Nt (8) t i= Iyi t = s Again each component of is set to a count ratio. To convert labeled generative training to the EM algorithm we again replace the indicator functions by probabilities as follows. Π n+ s = P n(y t = s x t ) (9) T T n+ s,w = O n+ s,u = Nt i= P n(yt i = s yt i+ = w x t ) Nt t i= P n(yt i = s x t ) Nt i= P n(yi t = s x i+ t = u x t ) Nt t i= P n(yi t = s x t ) 3 (20) (2)

The probabilities appearing in equations (9), (20), and (2) can be computed with the forward-backward procedure. 3 EM for PCFGs For probabilistic context free grammars (PCFGs) we have that x t is a string and y t is a parse tree whose yield (sequence of leaf nodes) is x t. We assume that y t must use only productions from a fixed (finite) set of productions G in Chimsky normal form, i.e., every production is either of the form X Y Z where X, Y, and Z are nonterminal symbols or of the form X a where a is a terminal symbol. For each production X α G we assume a probability P (X α) satisfying the following normalization constraint at each nonterminal X. P (X α) = α: X α G We let be the set of all probabilities of the form P (X α) and we define P (x t, y t ) as follows. { 0 if xt is not the yield of y P (x t, y t ) = t X α y t P (X α) otherwise In the above product over productions we allow a given production to occur more than once. We write #(X α, y t ) to be the number of times that the production X α occurs in the tree y t. We let #(X, y t ) be the number of nodes in the tree y t labeled with the nonterminal X. For PCFGs the generative training equation (8) has a closed form solution given as follows. P (X α) = #(X α, y t) #(X, y T ) (22) Again, in generative training each component of is set to a count ratio. We convert generative training to an EM update by replacing the counts with expected counts (this corresponds to replacing indicator functions by probabilities). P n+ (X α) = E y P n ( x t) #(X α, y) E y P n ( x t) #(X, y) (23) The expectations appearing in (23) can be computed using the inside-outside algrithm. 4 General EM In general EM we assume arbitrary sets X and Y (for now we will assume these sets are discrete). We assume a distribution on X Y parameterized by 4

parameter vector, i.e., for a given value of, and x X and y Y, we have a probability P (x, y). We are given a sample x,..., x t with x t X and we want to train the model using the following. = argmin T P (x t ) (24) P (x) = y Y P (x, y) (25) EM is an iterative algorithm that computes a sequence of parameter vectors 0,, 2,.... We will ignore the question of how to set 0 and concentrate instead on how to compute n+ from n. The basic idea is that n defines a probability distribution P n(y x). This distribution is treated as implicit labels. If we had a labeled data set x, y,..., x T, y T then generative training would set using the following. = argmin T P (x t, y t ) (26) In EM the following variant of this equation is used to compute n+ from n. n+ = argmin T E y P n ( x t) P (x t, y) (27) In each of the tree examples given in the previous sections, (26) has a closed form solution in which each parameter is given by a count ratio. In such cases equation (27) also has a closed form solution where each parameter is given by a ratio of expectations or probabilities. In each of the above examples, only the closed form solution for n+ given. But general EM algorithm is defined by (27). It is important to keep in mind that while (26) and (27) are often easily solved, equation (24) is much more challenging. For models in the exponential family (MRFs, CRFs) we have that (26) and (27) are convex in but (24) is not. The EM algorithm often converges in practice to a local optimum with a value significantly larger than the global optimum of (24). 5 The EM Theorem The EM theorem states, essentially, that the EM update makes progress in optimizing (24) on every step. Before proving the theorem we note that without loss of generality we can consider just a single pair of variables X and Y with a model P (X, Y ). This does not loose generality becasue we take X to be the sequence x,..., x T and Y to be the sequence y,..., y T. We can now state the EM theorem as follows. 5

EM Theorem: If E Y S P (X, Y ) < E Y S P (X, Y )) (28) where S = P ( X) (29) then P (X) > P (X) (30) Proof: E Y S E Y S P (X, Y ) P (X, Y ) E Y S P (X, Y ) P (X, Y ) E Y S P (X)P (Y X) P (X)P (Y X) E Y S P (X) + E Y S P (Y X) P (X) P (Y X) P (X) P (X) E Y S P (Y X) P (Y X) P (X) P (X) KL(S, P ( X)) P (X) P (X) P (X) > P (X) 6 EM for Alignment EM does not work well in practice for unlabeled training of HMMS or PCFGs. However, EM is useful in practice for solving alignment problemms that arise in partially labeled data. Speech recognition systems are trained with pairs of speech signals and corresponding sentences (word sequences). Although the whole speech signal is labeled with the whole sentence, the sentence and the signal are not aligned. The correspondence between a particular time in the speech signal and a particular place in the sentence is determined automatically. This 6

is done by running EM on a large corpus of such pairs treating the alignment as the latent variable. EM is used similarly for alignment in translation between natural languages. In this case the training data consists of translation pairs each of which is a pair of a sentence in the source langauge (say English) and its translation into the target language (say French). Although each source langauge sentence is labeled with its translation, the alignment is not specified there is no explicit labeling of which words or phrases in the source correspond to which words or phrases in the target. In practice the alignment is done automatically by running EM on a large corpus of training pairs treating the alignment as the latent variable. 7 Problems. Suppose we initialize the PCFG to be deterministic in the sense that for any word string there is at most one parse of that string with nonzero probability and suppose that each x h does have a parse under the initial grammar. How rapidly does EM converge in this case? What does it converge to? Justify your answer. 7