Weighted Finite-State Transducers in Computational Biology

Similar documents
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Lecture 6: April 19, 2002

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Latent Variable Models and EM Algorithm

Multiscale Systems Engineering Research Group

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

STA 414/2104: Machine Learning

Hidden Markov Models and Gaussian Mixture Models

STA 4273H: Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Pattern Recognition

Expectation maximization tutorial

EECS E6870: Lecture 4: Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Note Set 5: Hidden Markov Models

Expectation Maximization

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

But if z is conditioned on, we need to model it:

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Brief Introduction of Machine Learning Techniques for Content Analysis

Automatic Speech Recognition (CS753)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Statistical Pattern Recognition

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Expectation Maximization (EM)

Expectation Maximization (EM)

Hidden Markov Models

Graphical Models for Collaborative Filtering

p(d θ ) l(θ ) 1.2 x x x

The Expectation Maximization Algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Outline of Today s Lecture

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Latent Variable Models and EM algorithm

Statistical Machine Learning from Data

Linear Dynamical Systems

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Mixture Models and EM

Directed Probabilistic Graphical Models CMSC 678 UMBC

COM336: Neural Computing

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Conditional Random Fields: An Introduction

Lecture 2: GMM and EM

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

HMM: Parameter Estimation

Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Lecture 4: Probabilistic Learning

The Expectation Maximization or EM algorithm

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Dynamic Approaches: The Hidden Markov Model

Master 2 Informatique Probabilistic Learning and Data Analysis

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Modelling

Introduction to Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

Week 3: The EM algorithm

Hidden Markov Models Part 2: Algorithms

Expectation-Maximization (EM) algorithm

order is number of previous outputs

Latent Variable View of EM. Sargur Srihari

The Expectation-Maximization Algorithm

Statistical Methods for NLP

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Today s Lecture: HMMs

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Statistical Sequence Recognition and Training: An Introduction to HMMs

1 What is a hidden Markov model?

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Lecture 3: Machine learning, classification, and generative models

Lecture 15. Probabilistic Models on Graph

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Mixtures of Gaussians. Sargur Srihari

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Expectation Maximization

Estimating the parameters of hidden binomial trials by the EM algorithm

Expectation Maximization Mixture Models HMMs

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 9. Intro to Hidden Markov Models (finish up)

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization Algorithm

Joint Factor Analysis for Speaker Verification

Gaussian Mixture Models

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Conditional Random Field

Maximum likelihood estimation

Pattern Recognition and Machine Learning

Hidden Markov Models in Language Processing

Computational Genomics and Molecular Biology, Fall

Transcription:

Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1

This Tutorial Weighted finite-state transducers and software library Pairwise alignments for bioinformatics Learning weighted transducers 2 2

This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 3 3

Problem Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. X according to Problem: find distribution p out of a set P that best estimates D. 4 4

Maximum Likelihood Likelihood: probability of observing sample under distribution p P, which, given the independence assumption is m Pr[x 1,...,x m ]= p(x i ). Principle: select distribution maximizing sample probability m p = argmax p(x i ), p P m or p = argmax log p(x i ). p P 5 5

Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T,..., H. Bernoulli distribution: p(h) =θ, p(t )=1 θ. Likelihood: Solution: dl(p) dθ l = N(H) θ l(p) = log θ N(H) (1 θ) N(T ) = N(H) log θ + N(T ) log(1 θ). is differentiable and concave; N(T ) θ =0 θ = N(H) N(H)+N(T ). 6 6

Latent Variables Definition: unobserved or hidden variables, as opposed to those directly available at training and test time. example: mixture models, HMMs. Why latent variables? naturally unavailable variables: e.g., economics variables. modeling: latent variables introduced to model dependencies. 7 7

ML with Latent Variables Problem: with fully observed variables, ML is often straightforward, but with latent variables, the loglikelihood contains a sum that makes it harder to determine the best parameter values: L(θ, x) = log p θ [x] = log z p θ (x, z) = log z p θ [z x]p θ [x]. Idea: use current parameter values to estimate latent variables and use those to re-estimate parameter values. This leads to the EM algorithm. 8 8

EM Theorem Theorem: for any x, and parameter values θ and θ, L(θ,x) L(θ, x) z Proof: p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. L(θ,x) L(θ, x) = log p θ [x] log p θ [x] = z p θ [z x] log p θ [x] z p θ [z x] log p θ [x] = z p θ [z x] log p θ [x, z] p θ [z x] z p θ [z x] log p θ[x, z] p θ [z x] = z p θ [z x] log p θ[z x] p θ [z x] + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] = D(p θ [z x] p θ [z x]) + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. 9 9

EM Algorithm Algorithm: maximum-likelihood meta algorithm for models with latent variables. E-step: q t+1 p. θ t[z x] M-step: θ t+1 argmax q t+1 (z x) log p. θ [x, z] Interpretation: (Dempster, Laird, and Rubin, 1977; Wu, 1983) θ E-step: posterior probability of latent variables given observed variables and current parameter. M-step: maximum-likelihood parameter given all data. z 10 10

EM Algorithm - Notes Applications: mixture models, e.g., Gaussian mixtures. Latent variables: which model generated points. HMMs (Baum-Welsh algorithm). Notes: positive: each iteration increases likelihood. No parameter tuning. negative: can converge to local minima. Note that likelihood could converge but not parameter. Dependence on initial parameter. 11 11

EM = Coordinate Ascent Theorem: EM = coordinate ascent applied to A(q, θ) = z q[z x] log p θ[x, z] q[z x] = z q[z x] log p θ [x, z] z q[z x] log q[z x]. E-step: q t+1 argmax A(q, θ t ). q M-step: θ t+1 argmax A(q t+1,θ). θ Proof: the E-step follows the two identities A(q, θ) log q[z x] p θ[x, z] = log p θ [x, z] = log p θ [x] =L(θ, x), q[z x] z z A(p θ [z x],θ)= p θ [z x] log p θ [x] = log p θ [x] =L(θ, x). z 12 12

EM = Coordinate Ascent Proof (cont.): finally, note that argmax θ A(q t+1,θ) = argmax θ coincides with the M-step of EM. z q t+1 [z x] log p θ [x, z], 13 13

EM - Extensions and Variants (Dempster et al., 1977; Jamshidian and Jennrich, 1993; Wu, 1983) Generalized EM (GEM): maximization is not necessary at all steps since any (strict) increase of the auxiliary function guarantees an increase of the log-likelihood. Sparse EM: posterior probabilities computed at some points in E-step. 14 14

This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 15 15

Motivation Data: sample of msequences over alphabet Σ drawn i.i.d. according to some distribution D : x i 1,x i 2,...,x i k i, i =1,...,m. Problem: find sequence model that best estimates distribution D. Latent variables: observed sequences may have been generated by model with states and transitions that are hidden or unobserved. 16 16

Example Observations: sequences of Head and Tail. observed sequence: unobserved paths: H, T, T, H, T, H, T, H, H, H, T, T,..., H. (0,H,0), (0,T,1), (1,T,1),...,(2,H,2). Model: T/.4 H/.1 H/.3 T/.7 1 H/.6 T/.5 2 0 T/.4 17 17

Hidden Markov Models Definition: probabilistic automata, generative view. discrete case: finite alphabet Σ. simplification: for us, transition probability: Pr[transition e]. emission probability: Pr[emission a e]. Illutration: a/.3 0 a/.2 a/.2 b/.1 c/.2 w[e] = Pr[a e]pr[e]. /.5 18 18

Other Models Outputs at states (instead of transitions): equivalent model, dual representation. Non-discrete case: outputs in R N. fixed probabilities replaced by distributions, e.g., Gaussian mixtures. application to acoustic modeling. 19 19

Three Problems Problem 1: given sequence x 1,...,x k and hidden Markov model p θ compute p θ [x 1,...,x k ]. Problem 2: given sequence x 1,...,x k and hidden Markov model p θ find most likely path that generated that sequence. Problem 3: estimate parameters θ of a hidden Markov model via ML: p θ [x]. θ = argmax θ (Rabiner, 1989) π = e 1,...,e r 20 20

P1: Evaluation Algorithm: let X represent a finite automaton representing the sequence x = x 1 x k and let denote the current hidden Markov model. Then, p θ [x] = π X H w[π]. Thus, it can be computed using composition and a shortest-distance algorithm in time O( X H ). H 21 21

P2: Most Likely Path Algorithm: shortest-path problem in semiring. any shortest-path algorithm applied to the result of the composition X H. In particular, since the automaton is acyclic, with a linear-time shortest-path algorithm, the total complexity is in. O( X H ) =O( X H ) (max, ) a traditional solution is to use the Viterbi algorithm. 22 22

P3: Parameter Estimation Baum-Welsh algorithm: special case of EM applied to HMMs. Here, θ represents the transitions weights w[e]. The latent or hidden variables are paths π. M-step: argmax θ π subject to q Q, p θ [π x] log p θ [x, π] e E[q] w θ [e] =1. (Baum, 1972) 23 23

P3: Parameter Estimation Using Lagrange multipliers λ q, q Q, the problem consists of setting the following partial derivatives to zero: w θ [e] π [ π p θ [π x] log p θ [x, π] q p θ [π x] log p θ[x, π] w θ [e] only for paths labeled with :. In that case, p θ [x, π] = w θ [e] π e, where π e is the number of occurrences eof e in π. λ q λ orig[e] =0. e E[q] w θ [e] ] =0 p θ [x, π] 0 π x π Π(x) 24 24

P3: Parameter Estimation Thus, the equation with partial derivatives can be rewritten as π Π(x) p θ [π x] π e w θ [e] = λ orig(e) w θ [e] = w θ [e] = w θ [e] 1 λ orig(e) π Π(x) 1 λ orig(e) p θ [x] p θ [x, π] π e. π Π(x) p θ [π x] π e π Π(x) p θ [x, π] π e 25 25

P3: Parameter Estimation But, p θ [x, π] π e = α i 1 (θ )w θ [e]β i (θ ). Thus, with π Π(x) w θ [e] k k α i 1 (θ )w θ [e]β i (θ ) α i (θ )=p θ [x 1,...,x i q = orig[e]] β i (θ )=p θ [x i,...,x k dest[e]]. Expected number of times through transition e. Can be computed using forward-backward algorithms, or, for us, shortest-distance algorithms. 26 26

Continuous Models Graph topology: 3-state HMM model: for each CD phone in speech. ae b,d d0:! d1:! d2:! (Rabiner and Juang, 1993) 0 d0:! 1 d1:! 2 d2:ae b,d 3 Interpretation: beginning, middle, and end of CD phone. Continuous case: transition weights based on distributions over feature vectors in R N, typically with N =39. 27 27

Distributions Simple cases: e.g., single speaker, single Gaussian distribution N (x; µ, σ) = 1 exp (2π) N/2 σ 1/2 covariance matrix σtypically diagonal. General case: mixtures of Gaussians. M k=1 with λ i 0 and λ k N (x; µ k,σ k ), M ( 1 ) 2 (x µ) σ 1 (x µ). λ i =1.Typically, M =16. 28 28

GMMs - Illustration 29 29

Estimation Algorithm Baum-Welsh algorithm: maximum likelihood of joint distribution Pr[o, p]. requires segmentation to match feature vectors with HMM transitions. generalizes to continuous case with Gaussians and GMMs. Questions: segmentation, model initialization. 30 30

Single Gaussian Distribution Problem: find most likely Gaussian distribution, given sequence of real-valued observations Normal distribution: Likelihood: Solution: p(x) µ =0 µ = 1 m l 3.18, 2.35,.95, 1.175,... ( ) 1 p(x) = exp (x µ)2 2πσ 2 2σ 2 l(p) = 1 2 m log(2πσ2 ) 1 m (x i µ) 2 2 2σ 2. is differentiable and concave; m x i p(x) σ 2 =0 σ 2 = 1 m m x 2 i µ 2.. 31 31

Differentials General identities: log of determinant bilinear form log σ 1 σ 1 = σ. x σx σ = xx. 32 32

Single Gaussian - ML solution L = Log likelihood: sample. ( 1 Pr[x i ]= exp 1 ) (2π) N/2 σ 1/2 2 (x i µ) σ 1 (x i µ). m m log Pr[x i ]= N 2 log(2π)+1 2 log σ 1 1 2 (x i µ) σ 1 (x i µ). For each x i, ML solution: x 1,...,x m L m µ = L m σ 1 = σ 1 (x i µ) =0 µ = 1 m m x i. 1 2 (σ (x i µ)(x i µ) )=0 σ = 1 m m (x i µ)(x i µ). 33 33

GMMs - EM Algorithm Mixture of p θ [x] = M EM algorithm: let Gaussians: M λ k N (x; µ k,σ k ) L = k=1 k=1 m M log λ k N (x i ; µ k,σ k ). p t i,k = N (x i ; µ t k,σ t k). E-step: M-step: q t+1 i,k µ t+1 k = σ t+1 k = λ t+1 k = p θ t[z = k x i]= m qt+1 i,k x i m qt+1 = 1 m i,k m qt+1 m λ t k pt i,k M. k=1 λt k pt i,k i,k (x i µ t+1 k )(x i µ t+1 q t+1 i,k. m qt+1 i,k 34 k ) 34

GMMs - EM Algorithm l = = Proof: M-step. Auxiliary function: M k=1 M k=1 m p θ [z = k x i ] log p θ [x i,z = k] = Optimization for fixed q and : M k=1 m q i,k log p θ [x i,z = k]. m [ q i,k log λk 1 2 (x i µ t+1 k ) σ 1 k (x i µ t+1 k ) 1 2 log(2π)+1 2 l = σ 1 k µ k L σ 1 k = 1 2 m (x i µ k ) l λ k = 1 λ k m q i,k β m q i,k (σk (x i µ)(x i µ) ). M k=1 λ k =1 log σ 1 k ]. 35 35

References L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38.. Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228. Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989. O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectationmaximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998. C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. 36 36