Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Similar documents
Weighted Finite-State Transducers in Computational Biology

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Multiscale Systems Engineering Research Group

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Lecture 6: April 19, 2002

The Expectation Maximization Algorithm

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Mixture Models and EM

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

STA 414/2104: Machine Learning

Statistical Pattern Recognition

STA 4273H: Statistical Machine Learning

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models

Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

p(d θ ) l(θ ) 1.2 x x x

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

Note Set 5: Hidden Markov Models

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Outline of Today s Lecture

Expectation maximization tutorial

EECS E6870: Lecture 4: Hidden Markov Models

Statistical Pattern Recognition

Expectation Maximization (EM)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Master 2 Informatique Probabilistic Learning and Data Analysis

Expectation Maximization

Machine Learning for natural language processing

order is number of previous outputs

A Note on the Expectation-Maximization (EM) Algorithm

Automatic Speech Recognition (CS753)

Dynamic Approaches: The Hidden Markov Model

HMM: Parameter Estimation

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Variable Models and EM algorithm

Statistical Machine Learning from Data

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Today s Lecture: HMMs

Expectation Maximization (EM)

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Directed Probabilistic Graphical Models CMSC 678 UMBC

Lecture 9. Intro to Hidden Markov Models (finish up)

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models and Gaussian Mixture Models

O 3 O 4 O 5. q 3. q 4. Transition

Linear Dynamical Systems

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

But if z is conditioned on, we need to model it:

Mixtures of Rasch Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

The Expectation Maximization or EM algorithm

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

CS838-1 Advanced NLP: Hidden Markov Models

Lecture 2: GMM and EM

Hidden Markov Models in Language Processing

Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

Estimating the parameters of hidden binomial trials by the EM algorithm

Training Hidden Markov Models with Multiple Observations A Combinatorial Method

EM-algorithm for motif discovery

Deriving the EM-Based Update Rules in VARSAT. University of Toronto Technical Report CSRG-580

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Brief Introduction of Machine Learning Techniques for Content Analysis

1. Markov models. 1.1 Markov-chain

Lecture 8: Graphical models for Text

EM for ML Estimation

Learning from Sequential and Time-Series Data

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

Statistical Methods for NLP

Expectation Maximization Mixture Models HMMs

Hidden Markov Modelling

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Hidden Markov Models and Gaussian Mixture Models

Graphical Models for Collaborative Filtering

Partially-Hidden Markov Models

Week 3: The EM algorithm

Hidden Markov Models

EM Algorithm LECTURE OUTLINE

Hidden Markov Models Part 2: Algorithms

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains

Expectation-Maximization (EM) algorithm

Particle Swarm Optimization of Hidden Markov Models: a comparative study

Statistical NLP: Hidden Markov Models. Updated 12/15

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 3: Machine learning, classification, and generative models

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Transcription:

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com

This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 2

Latent Variables Definition: unobserved or hidden variables, as opposed to those directly available at training and test time. example: mixture models, HMMs. Why latent variables? naturally unavailable variables: e.g., economics variables. modeling: latent variables introduced to model dependencies. 3

ML with Latent Variables Problem: with fully observed variables, ML is often straightforward. with latent variables, log-likelihood contains a sum (harder to find the best parameter values): L(θ, x) = log p θ [x] = log z p θ (x, z) = log z p θ [z x]p θ [x]. Idea: use current parameter values to estimate latent variables and use those to re-estimate parameter values EM algorithm. 4

EM Theorem Theorem: for any x, and parameter values θ and θ, L(θ,x) L(θ, x) z Proof: p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. L(θ,x) L(θ, x) = log p θ [x] log p θ [x] = z p θ [z x] log p θ [x] z p θ [z x] log p θ [x] = z p θ [z x] log p θ [x, z] p θ [z x] z p θ [z x] log p θ[x, z] p θ [z x] = z p θ [z x] log p θ[z x] p θ [z x] + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] = D(p θ [z x] p θ [z x]) + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. 5

EM Idea Maximize expectation: z p θ [z x] log p θ [x, z] = Iterations: E z p θ [ x] compute p θ [z x] for current value of θ. compute expectation and find maximizing. log p θ [x, z]. θ 6

EM Algorithm Algorithm: maximum-likelihood meta algorithm for models with latent variables. E-step: q t+1 p. θ t[z x] M-step: θ t+1 argmax q t+1 (z x) log p. θ [x, z] Interpretation: (Dempster, Laird, and Rubin, 1977; Wu, 1983) θ E-step: posterior probability of latent variables given observed variables and current parameter. M-step: maximum-likelihood parameter given all data. z 7

EM Algorithm - Notes Applications: mixture models, e.g., Gaussian mixtures. Latent variables: which model generated points. HMMs (Baum-Welsh algorithm). Notes: positive: each iteration increases likelihood. No parameter tuning. negative: can converge to local optima. Note that likelihood could converge but not parameter. Dependence on initial parameter. 8

EM = Coordinate Ascent Theorem: EM = coordinate ascent applied to A(q, θ) = z q[z x] log p θ[x, z] q[z x] = z q[z x] log p θ [x, z] z q[z x] log q[z x]. E-step: q t+1 argmax A(q, θ t ). q M-step: θ t+1 argmax A(q t+1,θ). θ Proof: the E-step follows the two identities A(q, θ) log z A(p θ [z x],θ)= z q[z x] p θ[x, z] = log p θ [x, z] = log p θ [x] =L(θ, x), q[z x] z p θ [z x] log p θ [x] = log p θ [x] =L(θ, x). 9

EM = Coordinate Ascent Proof (cont.): finally, note that argmax θ A(q t+1,θ) = argmax θ coincides with the M-step of EM. q t+1 [z x] log p θ [x, z], z 10

EM - Extensions and Variants (Dempster et al., 1977; Jamshidian and Jennrich, 1993; Wu, 1983) Generalized EM (GEM): maximization is not necessary at all steps since any (strict) increase of the auxiliary function guarantees an increase of the log-likelihood. Sparse EM: posterior probabilities computed at some points in E-step. 11

This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 12

Motivation Data: sample of msequences over alphabet Σ drawn i.i.d. according to some distribution D : x i 1,x i 2,...,x i k i, i =1,...,m. Problem: find sequence model that best estimates distribution D. Latent variables: observed sequences may have been generated by model with states and transitions that are hidden or unobserved. 13

Example Observations: sequences of Heads and Tails. observed sequence: unobserved paths: H, T, T, H, T, H, T, H, H, H, T, T,..., H. (0,H,0), (0,T,1), (1,T,1),...,(2,H,2). Model: T/.4 H/.1 H/.3 T/.7 1 H/.6 T/.5 2 0 T/.4 14

Hidden Markov Models Definition: probabilistic automata, generative view. discrete case: finite alphabet Σ. simplification: for us, transition probability: Pr[transition e]. emission probability: Pr[emission a e]. Illustration: a/.3 0 a/.2 a/.2 b/.1 c/.2 w[e] = Pr[a e]pr[e]. /.5 15

Other Models Outputs at states (instead of transitions): equivalent model, dual representation. Non-discrete case: outputs in R N. fixed probabilities replaced by distributions, e.g., Gaussian mixtures. application to acoustic modeling. 16

Three Problems Problem 1: given sequence x 1,...,x k and hidden Markov model p θ compute p θ [x 1,...,x k ]. Problem 2: given sequence x 1,...,x k and hidden Markov model p θ find most likely path that generated that sequence. Problem 3: estimate parameters θ of a hidden Markov model via ML: p θ [x]. θ = argmax θ (Rabiner, 1989) π = e 1,...,e r 17

P1: Evaluation Algorithm: let X represent a finite automaton representing the sequence x = x 1 x k and let denote the current hidden Markov model. Then, p θ [x] = π X H w[π]. Thus, it can be computed using composition and a shortest-distance algorithm in time O( X H ). H 18

P2: Most Likely Path Algorithm: shortest-path problem in semiring. any shortest-path algorithm applied to the result of the composition X H. In particular, since the automaton is acyclic, with a linear-time shortest-path algorithm, the total complexity is in. O( X H ) =O( X H ) (max, ) a traditional solution is to use the Viterbi algorithm. 19

P3: Parameter Estimation Baum-Welsh algorithm: special case of EM applied to HMMs. Here, θ represents the transitions weights w[e]. The latent or hidden variables are paths π. M-step: argmax θ π subject to q Q, p θ [π x] log p θ [x, π] e E[q] w θ [e] =1. (Baum, 1972) 20

P3: Parameter Estimation Using Lagrange multipliers λ q, q Q, the problem consists of setting the following partial derivatives to zero: w θ [e] π [ π p θ [π x] log p θ [x, π] q p θ [π x] log p θ[x, π] w θ [e] only for paths labeled with :. In that case, p θ [x, π] = w θ [e] π e, where π e is the number of occurrences eof e in π. λ q λ orig[e] =0. e E[q] w θ [e] ] =0 p θ [x, π] 0 π x π Π(x) 21

P3: Parameter Estimation Thus, the equation with partial derivatives can be rewritten as π Π(x) p θ [π x] π e w θ [e] = λ orig(e) w θ [e] = w θ [e] = w θ [e] 1 λ orig(e) π Π(x) 1 λ orig(e) p θ [x] p θ [x, π] π e. π Π(x) p θ [π x] π e π Π(x) p θ [x, π] π e 22

P3: Parameter Estimation But, p θ [x, π] π e = α i 1 (θ )w θ [e]β i (θ ). Thus, with π Π(x) w θ [e] k i=1 k i=1 α i 1 (θ )w θ [e]β i (θ ) α i (θ )=p θ [x 1,...,x i q = orig[e]] β i (θ )=p θ [x i,...,x k dest[e]]. Expected number of times through transition e. Can be computed using forward-backward algorithms, or, for us, shortest-distance algorithms. 23

References L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38.. Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228. Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989. O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectationmaximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998. C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. 24