EM algorithm and applications Lecture #9

Similar documents
Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Statistical Methods for NLP

HIDDEN MARKOV MODELS

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Unit 1: Sequence Models

Hidden Markov Models

HMM: Parameter Estimation

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Naïve Bayes classification

An introduction to PRISM and its applications

CS Lecture 18. Expectation Maximization

Natural Language Processing

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Hidden Markov Models. Three classic HMM problems

Statistical NLP: Hidden Markov Models. Updated 12/15

Expectation maximization tutorial

Expectation-Maximization (EM) algorithm

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models 1

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Weighted Finite-State Transducers in Computational Biology

CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012

Dynamic Approaches: The Hidden Markov Model

Lagrange Multipliers


Lecture 1. ABC of Probability

Stephen Scott.

Probability Theory for Machine Learning. Chris Cremer September 2015

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Directed Probabilistic Graphical Models CMSC 678 UMBC

A Note on the Expectation-Maximization (EM) Algorithm

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Consider the equation different values of x we shall find the values of y and the tabulate t the values in the following table

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Computing the MLE and the EM Algorithm

Hidden Markov Models

Hidden Markov Modelling

Hidden Markov Models. Hosein Mohimani GHC7717

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Introduction to Machine Learning CMU-10701

Introduction to Bayesian Learning

EECS E6870: Lecture 4: Hidden Markov Models

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Lecture notes for probability. Math 124

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models. x 1 x 2 x 3 x K

Review of Basic Probability

STAT 430/510 Probability Lecture 7: Random Variable and Expectation

Automatic Speech Recognition (CS753)

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

MACHINE LEARNING 2 UGM,HMMS Lecture 7

02 Background Minimum background on probability. Random process

Statistical Pattern Recognition

EM (cont.) November 26 th, Carlos Guestrin 1

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

CPSC 540: Machine Learning

Lecture #5. Dependencies along the genome

What is a random variable

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Distribusi Binomial, Poisson, dan Hipergeometrik

Phasing via the Expectation Maximization (EM) Algorithm

Allele Frequency Estimation

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

LOCUS. Definition: The set of all points (and only those points) which satisfy the given geometrical condition(s) (or properties) is called a locus.

Data Structures and Algorithm Analysis (CSC317) Randomized algorithms

Lecture 6: Entropy Rate

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probability and Estimation. Alan Moses

Goodness of Fit Goodness of fit - 2 classes

26. LECTURE 26. Objectives

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Probability and Independence Terri Bittner, Ph.D.

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Learning Bayesian Networks (part 1) Goals for the lecture

Lecture 12: Algorithms for HMMs

Qualifier: CS 6375 Machine Learning Spring 2015

Transcription:

EM algorithm and applications Lecture #9 Bacground Readings: Chapters 11.2, 11.6 in the text boo, Biological Sequence Analysis, Durbin et al., 2001..

The EM algorithm This lecture plan: 1. Presentation and Correctness Proof of the EM algorithm. 2. Examples of Implementations 2

Model, Parameters, ML A model with parameters θ is a probabilistic space M, in which each simple event y is determined by values of random variables (dice). The parameters θ are the probabilities associated with the random variables. (In HMM of length L, the simple events are HMM-sequences of length L, and the parameters are the transition probabilities m l and the emission probabilities e (b)). An observed data is a non empty subset x M. (In HMM, it can be all the simple events which fit with a given output sequence). Given observed data x, the ML method sees parameters θ* which maximize the lielihood of the data p(x θ)= y p(x,y θ). Finding such θ* is easy when the observed data is a simple event, but hard in general. 3

The EM algorithm Assume a model with parameters as in the previous slide. Given observed data x, the lielihood of x under model parameters θ is given by p(x θ)= y p(x,y θ). (The y s are the simple events which comprise x, usually determined by the possible values of hidden data ). The EM algorithm receives x and parameters θ, and returns new parameters λ* s.t. p(x λ*) > p(x θ). i.e., the new parameters increase the lielihood of the observed data. 4

The EM algorithm The graphs below are the logarithms of the lielihood functions Log(L θ )= E θ [log P(x,y λ)] log P(x λ) θ λ EM uses the current parameters θ to construct a simpler ML problem L θ : L ( ) p( x, y ) p x y θ λ = λ (, θ ) Guarantee: if L θ (λ)>l θ (θ), than P(x λ)>p(x θ). y λ 5

Derivation of the EM Algorithm Let x be the observed data. Let {(x,y 1 ),,(x,y )} be the set of (simple) events which comprise x. Our goal is to find parameters θ * which maximize the sum px (, θ ) = pxy (, θ ) + pxy (, θ ) +.. + pxy (, θ ) * * * * 1 2 As this is hard, we start with some parameters θ, and only find λ * s.t.: * * (, λ ) = (, i λ ) > (, i θ) = (, θ) i= 1 i= 1 p x p x y pxy px 6

For given parameters θ, Let p i =p(y i x,θ). (note that p 1 + +p =1). We use the p i s to define virtual sampling, in which: y 1 occurs p 1 times, y 2 occurs p 2 times, y occurs p times The EM algorithm loos for new parameters λ which maximize the lielihood of this "virtual" sampling. This lielihhod is given by L λ p y x λ p y x λ p y x λ θ p1 p2 p ( )= (, ) (, ) (, ). 1 2 7

In each iteration the EM algorithm does the following. (E step): Calculate The EM algorithm p y x L ( λ )= θ θ p y x λ (, ) (, ), (M step): Find λ* which maximizes L θ (λ) (ext iteration sets θ λ* and repeat). y Comment: 1. At the M-step we only need that L θ (λ*)>l θ (θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal λ*. 2. Usually, Q ( λ)= log( L ( λ))= p( y x, θ)log( p( y, x λ)) is used. θ θ y 8

Correctness Theorem for the EM Algorithm Theorem: Let x= {( x, y i= 1 1 ),..,( xy, )} be a collection of events, as in the setting of the EM algorithm, and let L θ ( λ) = prob( xy, λ) Then the following holds: If L prob( y x, θ ) * * θ( λ ) > Lθ( θ), then prob( x λ ) > pr i i ob( x θ ). 9

Correctness proof of EM * Let prob( yi x, θ) = pi,prob( yi x, λ ) = qi. Then from the definition of conditional probability we have: * * prob( xy, i θ) = pi prob( x θ), prob( xy, i λ ) = qi prob( x λ ). * By the EM assumption on θ and λ : i= 1 since * pi * ( q prob( x λ )) = L ( λ ) > L ( θ) = ( p prob( x θ)) i p = q = 1 we get: i= 1 i i= 1 i θ i= 1 p * i pi ( q ) prob( ) ( ) prob( ) 1 i x λ > p i i 1 i x θ = = θ i p i 10

Correctness proof of EM (end) from last slide: p * i pi ( q ) prob( ) ( ) prob( ) [1] 1 i x λ > p i i 1 i x θ = = pi ( q ) * i= 1 i > p ( p i i= 1 i ) < 1 by the ML principle i= 1 p i q i 1 i < = i= 1 pi By the ML principle we have: ( ) ( p ). pi Dividing equation [1] by ( p ) we get : prob( x λ ) QED i * prob( x λ ) > prob( x θ ) by [1] above i 11

Example: Baum Welsh = EM for HMM The Baum-Welsh algorithm is the EM algorithm for HMM: p s x E step for HMM: L ( λ )= p s x λ (, θ ) θ (, ), s where λ are the new parameters {m l,e (b)}. M step for HMM: loo for λ which maximizes L θ (λ). Recall that for HMM, psx λ = m ( ) (, ) ( e() b ) s l M E b l l, b, s 12

Baum Welsh = EM for HMM (cont) Ml E ( b) writing psx (, λ) as m e() b we get l l, b, s l Lθ ( λ))= m l e ( b) s, l, b s s Ml p( s x, θ) E b p s x θ s s m ( ) (, ) l e ( b). l, b, M l s s M E ( b) s p( s x, θ ) As we showed, L ( λ )) is maximized when the m 's and e ( b)' s θ E (b) are the relative frequencies of the corresponding variables given x and θ. i.e., = l ml = M M l l ' l ' and e( b) = E( b) E b ( ') b' 13

A simple example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities θ H, θ T = 1- θ H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the lielihood of the data. Let the initial parameters be θ = (θ H, θ T ) = ( ¼, ¾ ). 14

EM for 2 coin tosses (cont) The hidden data which produce x are the sequences y 1 = (T,H); y 2 =(T,T); Hence the lielihood of x with parameters (θ H, θ T ), is p(x θ) = P(x,y 1 θ) + P(x,y 2 θ) = q H q T +q 2 T For the initial parameters θ = ( ¼, ¾ ), we have: p(x θ) = ¼ ¾+ ¾ ¾= ¾ ote that in this case P(x,y i θ) = P(y i θ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y (x,y) and replace the y s by y s). 15

EM for 2 coin tosses - E step Calculate L θ (λ) = L θ (λ H,λ T ). Recall: λ H,λ T are the new parameters, which we need to optimize Lθ ( λ) = p( x, y λ) p( x, y λ) p( y1 x, θ ) p( y2 x, θ ) 1 2 p(y 1 x,θ) = p(y 1,x θ)/p(x θ) = (¾ ¼)/ (¾) = ¼ p(y 2 x,θ) = p(y 2,x θ)/p(x θ) = (¾ ¾)/ (¾) = ¾ Thus we have 1 3 4 4 1 2 Lθ ( λ) = p( x, y λ) p( x, y λ) 16

EM for 2 coin tosses - E step For a sequence y of coin tosses, let H (y) be the number of H s in y, and T (y) be the number of T s in y. Then ( y) ( y) H T pyλ = λ λ ( ) H T In our example: y 1 = (T,H); y 2 =(T,T), hence: H (y 1 ) = T (y 1 )=1, H (y 2 ) =0, T (y 2 )=2 17

Example: 2 coin tosses - E step Thus T ( y1) H ( y1) 1 T H T H T ( y2) H ( y2) 2 pxy (, 2 λ) = λt λh = λt 1 3 ( ) (, 4 4 1 ) (, 2 ) 1 3 4 2 ( λλ ) ( ) 4 T H λt = 1 ( 3 1 3 1) ( 2) ( 1) ( 2) 4 T y + 4 T y 4 H y + 4 H y T λh Lθ λ = p x y λ p x y λ = λ pxy (, λ) = λ λ = λ λ And in general: T = 7 /4 H = ¼ T L λ = λ λ θ H ( ) T H 18

EM for 2 coin tosses - M step Find λ* which maximizes L θ (λ) And as we already saw, λ H λ T T λ H H is maximized when: H T = ; λt = + + H T H T λ H 1 7 4 4 = = 1 ; λ 7 8 T = = 8 1 + 7 1 + 7 4 4 4 4 that is, λ* = (, ) and px ( λ*) = 1 7 7 8 8 8. [The optimal parameters (0,1), will never be reached by the EM algorithm!] 19

EM for single random variable (dice) ow, the probability of each y ( (x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities λ 1,..,λ m. Let (y) = #(outcome occurs in y). Then m pyλ ( ) = λ = 1 Let be the expected value of (y), given x and θ: Then we have: ( y) =E( x,θ) = y p(y x,θ) (y), 20

L θ (λ) for one dice L θ ( λ) ( ) p( y x, θ ) = p y λ = y m p( y x, θ ) m ( y) p( y x, θ ) m ( y) y λ = λ = λ y = 1 = 1 = 1 which is maximized for λ = ' ' 21

EM algorithm for n independent observations x 1,, x n : Expectation step It can be shown that, if the x j are independent, then: n n j j j j j = (, ) (, ) = j= 1 j y j= 1 p y x θ y x n j= 1 1 j px ( θ ) j = y j j j j j py (, x θ ) ( y, x) 22

23 Example: The ABO locus A locus is a particular place on the chromosome. Each locus state (called genotype) consists of two alleles one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. q q q q q q o o o o b a b a o b o b b b b b o a o a a a a a / / / / / / / / / / / /,,,,, = = = = = = Suppose we randomly sampled individuals and found that a/a have genotype a/a, a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O)? The problem is that among individuals measured to have blood type A, we don t now how many have genotype a/a and how many have genotype a/o. So what can we do? 24

The ABO locus (Cont.) The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y: 25

The ABO locus (Cont.) The dice outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O. Each blood type is determined by two successive random sampling of alleles, which is an ordered genotypes pair this is the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a). So we have three parameters of one dice q a,q b,q o -that we need to estimate. We start with parameters θ = (q a,q b,q o ), and then use EM to improve them. 26

EM setting for the ABO locus The observed data x =(x 1,..,x n ) is a sequence of elements (blood types) from the set {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x 1, x 11 ). The hidden data (ie the y s) for each x j is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}. The parameters θ= {q a,q b, q o } are the probabilities of the alleles. 27

EM for ABO loci For each observed blood type x j {A,B,AB,O} and for each allele z in {a,b,o} we compute z (x j ), the expected number of times that z appear in x j. j j j j z( x ) = p( y x, θ ) z( y ) y j Where the sum is taen over the ordered genotype pairs y j, and z (y j ) is the number of times allele z occurs in the pair y j. eg, a (o,b)=0; b (o,b) = o (o,b) = 1. 28

EM for ABO loci The computation for blood type B: P(B θ) = P((b,b) θ) + p((b,o) θ) +p((o,b) θ)) = q 2 b + 2q b q o. Since b ((b,b))=2, and b ((b,o))= b ((o,b)) = o ((o,b))= o ((b,o))=1, o (B) and b (B), the expected number of occurrences of o and b in B, are given by: 2qq b o 2qq b o o( B) = p( y B, θ ) o( y) = = 2 pb ( θ ) q + 2q q y ( B) = p( y B, θ ) ( y) = b y b + 2 2qb 2qbqo 2 qb + 2qbqo b b o Observe that b (B)+ o (B) = 2 29

EM for ABO loci Similarly, P(A θ) = q a 2 + 2q a q o. 2 2qq a o 2qa + 2qq a o 2 a 2 a + 2 a o a + 2 a o o( A) =, ( A) = q q q q q q P(AB θ) = p((b,a) θ) + p((a,b) θ)) = 2q a q b ; a (AB) = b (AB) = 1 P(O θ) = p((o,o) θ) = q o 2 o (O) = 2 [ b (O) = a (O) = o (AB) = b (A) = a (B) = 0 ] 30

E step: compute a, b and o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. = #( A) ( A) + #( AB) ( AB) a a a = #( B) ( B) + #( AB) ( AB) b b b = #( A) ( A) + #( B) ( B) + #( O) ( O) o o o o ote that + + = 2 = 22 a b o M step: set λ*=( q a *, q b *, q o *) q * a = a 2 ; q * b = b 2 ; q * o = o 2 31

EM for a general discrete stochastic processes ow we wish to maximize lielihood of observation x with hidden data as before, ie maximize p(x λ)= y p(x,y λ). But this time experiment (x,y) is generated by a general stochastic process. The only assumption we mae is that the outcome of each experiment consists of a (finite) sequence of samplings of r discrete random variables (dices) Z 1,..., Z r, each of the Z i s can be sampled few times. This can be realized by a probabilistic acyclic state machine, where at each state some Z i is sampled, and the next state is determined by the outcome until a final state is reached. 32

EM for processes with many dices Example: In HMM, the random variables are the transmissions and emission probabilities: a l, e (b). x is the visible information y is the sequence s of states (x,y) is the complete HMM As before, we can redefine y so that (x,y) = y. s 1 s 2 s L-1 s L s i X 1 X 2 X L-1 X L X i 33

EM for processes with many dices Each random variable Z l (l =1,...,r)hasm l values z l,1,...z l,ml with probabilities {q l =1,...,m l }. Each y defines a sequence of outcomes (z l1, 1,...,z ln, n ) of the random variables used in y. In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence y j. Let l (y) = #(z l appears in y). 34

EM for processes with many dices Similarly to the single dice case, we have: pyλ ( ) r m = l l= 1 = 1 λ l l ( y) Define l as the expected value of l (y), given x and θ: Then we have: l =E( l x,θ) = y p(y x,θ) l (y), 35

L θ Q θ (λ) for processes with many dices l l= 1 = 1 l l= 1 = 1 p( y x, θ ) r ml p( y x, θ ) l ( y) ( λ) = p( y λ) = λl = y y l= 1 = 1 r ml l ( y) p( y x, θ ) r ml y l λ = λ where is the expected number of times that, l given x and θ, the outcome of dice l was : L θ = ( y) p( y x, θ ). l y ( λ) is maximized for λ l l = l ' l l ' 36

EM algorithm for processes with many dices Similarly to the one dice case we get: Expectation step Set l to E ( l (y) x,θ), ie: l = y p(y x,θ) l (y) Maximization step Set λ l = l / ( l ) 37

EM algorithm for n independent observations x 1,, x n : Expectation step It can be shown that, if the x j are independent, then: n n j j j j j l = (, ) l (, ) = l j= 1 j y j= 1 py x θ y x j l n j= 1 1 j px ( ) j = y j j j j py (, x θ ) ( y, x) l 38

EM in Practice Initial parameters: Random parameters setting Best guess from other source Stopping criteria: Small change in lielihood of data Small change in parameter values Avoiding bad local maxima: Multiple restarts Early pruning of unpromising ones 39