EM algorithm and applications Lecture #9
|
|
- Shawn Lee
- 5 years ago
- Views:
Transcription
1 EM algorithm and applications Lecture #9 Bacground Readings: Chapters 11.2, 11.6 in the text boo, Biological Sequence Analysis, Durbin et al.,
2 The EM algorithm This lecture plan: 1. Presentation and Correctness Proof of the EM algorithm. 2. Examples of Implementations 2
3 Model, Parameters, ML A model with parameters θ is a probabilistic space M, in which each simple event y is determined by values of random variables (dice). The parameters θ are the probabilities associated with the random variables. (In HMM of length L, the simple events are HMM-sequences of length L, and the parameters are the transition probabilities m l and the emission probabilities e (b)). An observed data is a non empty subset x M. (In HMM, it can be all the simple events which fit with a given output sequence). Given observed data x, the ML method sees parameters θ* which maximize the lielihood of the data p(x θ)= y p(x,y θ). Finding such θ* is easy when the observed data is a simple event, but hard in general. 3
4 The EM algorithm Assume a model with parameters as in the previous slide. Given observed data x, the lielihood of x under model parameters θ is given by p(x θ)= y p(x,y θ). (The y s are the simple events which comprise x, usually determined by the possible values of hidden data ). The EM algorithm receives x and parameters θ, and returns new parameters λ* s.t. p(x λ*) > p(x θ). i.e., the new parameters increase the lielihood of the observed data. 4
5 The EM algorithm The graphs below are the logarithms of the lielihood functions Log(L θ )= E θ [log P(x,y λ)] log P(x λ) θ λ EM uses the current parameters θ to construct a simpler ML problem L θ : L ( ) p( x, y ) p x y θ λ = λ (, θ ) Guarantee: if L θ (λ)>l θ (θ), than P(x λ)>p(x θ). y λ 5
6 Derivation of the EM Algorithm Let x be the observed data. Let {(x,y 1 ),,(x,y )} be the set of (simple) events which comprise x. Our goal is to find parameters θ * which maximize the sum px (, θ ) = pxy (, θ ) + pxy (, θ ) pxy (, θ ) * * * * 1 2 As this is hard, we start with some parameters θ, and only find λ * s.t.: * * (, λ ) = (, i λ ) > (, i θ) = (, θ) i= 1 i= 1 p x p x y pxy px 6
7 For given parameters θ, Let p i =p(y i x,θ). (note that p 1 + +p =1). We use the p i s to define virtual sampling, in which: y 1 occurs p 1 times, y 2 occurs p 2 times, y occurs p times The EM algorithm loos for new parameters λ which maximize the lielihood of this "virtual" sampling. This lielihhod is given by L λ p y x λ p y x λ p y x λ θ p1 p2 p ( )= (, ) (, ) (, )
8 In each iteration the EM algorithm does the following. (E step): Calculate The EM algorithm p y x L ( λ )= θ θ p y x λ (, ) (, ), (M step): Find λ* which maximizes L θ (λ) (ext iteration sets θ λ* and repeat). y Comment: 1. At the M-step we only need that L θ (λ*)>l θ (θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal λ*. 2. Usually, Q ( λ)= log( L ( λ))= p( y x, θ)log( p( y, x λ)) is used. θ θ y 8
9 Correctness Theorem for the EM Algorithm Theorem: Let x= {( x, y i= 1 1 ),..,( xy, )} be a collection of events, as in the setting of the EM algorithm, and let L θ ( λ) = prob( xy, λ) Then the following holds: If L prob( y x, θ ) * * θ( λ ) > Lθ( θ), then prob( x λ ) > pr i i ob( x θ ). 9
10 Correctness proof of EM * Let prob( yi x, θ) = pi,prob( yi x, λ ) = qi. Then from the definition of conditional probability we have: * * prob( xy, i θ) = pi prob( x θ), prob( xy, i λ ) = qi prob( x λ ). * By the EM assumption on θ and λ : i= 1 since * pi * ( q prob( x λ )) = L ( λ ) > L ( θ) = ( p prob( x θ)) i p = q = 1 we get: i= 1 i i= 1 i θ i= 1 p * i pi ( q ) prob( ) ( ) prob( ) 1 i x λ > p i i 1 i x θ = = θ i p i 10
11 Correctness proof of EM (end) from last slide: p * i pi ( q ) prob( ) ( ) prob( ) [1] 1 i x λ > p i i 1 i x θ = = pi ( q ) * i= 1 i > p ( p i i= 1 i ) < 1 by the ML principle i= 1 p i q i 1 i < = i= 1 pi By the ML principle we have: ( ) ( p ). pi Dividing equation [1] by ( p ) we get : prob( x λ ) QED i * prob( x λ ) > prob( x θ ) by [1] above i 11
12 Example: Baum Welsh = EM for HMM The Baum-Welsh algorithm is the EM algorithm for HMM: p s x E step for HMM: L ( λ )= p s x λ (, θ ) θ (, ), s where λ are the new parameters {m l,e (b)}. M step for HMM: loo for λ which maximizes L θ (λ). Recall that for HMM, psx λ = m ( ) (, ) ( e() b ) s l M E b l l, b, s 12
13 Baum Welsh = EM for HMM (cont) Ml E ( b) writing psx (, λ) as m e() b we get l l, b, s l Lθ ( λ))= m l e ( b) s, l, b s s Ml p( s x, θ) E b p s x θ s s m ( ) (, ) l e ( b). l, b, M l s s M E ( b) s p( s x, θ ) As we showed, L ( λ )) is maximized when the m 's and e ( b)' s θ E (b) are the relative frequencies of the corresponding variables given x and θ. i.e., = l ml = M M l l ' l ' and e( b) = E( b) E b ( ') b' 13
14 A simple example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities θ H, θ T = 1- θ H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the lielihood of the data. Let the initial parameters be θ = (θ H, θ T ) = ( ¼, ¾ ). 14
15 EM for 2 coin tosses (cont) The hidden data which produce x are the sequences y 1 = (T,H); y 2 =(T,T); Hence the lielihood of x with parameters (θ H, θ T ), is p(x θ) = P(x,y 1 θ) + P(x,y 2 θ) = q H q T +q 2 T For the initial parameters θ = ( ¼, ¾ ), we have: p(x θ) = ¼ ¾+ ¾ ¾= ¾ ote that in this case P(x,y i θ) = P(y i θ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y (x,y) and replace the y s by y s). 15
16 EM for 2 coin tosses - E step Calculate L θ (λ) = L θ (λ H,λ T ). Recall: λ H,λ T are the new parameters, which we need to optimize Lθ ( λ) = p( x, y λ) p( x, y λ) p( y1 x, θ ) p( y2 x, θ ) 1 2 p(y 1 x,θ) = p(y 1,x θ)/p(x θ) = (¾ ¼)/ (¾) = ¼ p(y 2 x,θ) = p(y 2,x θ)/p(x θ) = (¾ ¾)/ (¾) = ¾ Thus we have Lθ ( λ) = p( x, y λ) p( x, y λ) 16
17 EM for 2 coin tosses - E step For a sequence y of coin tosses, let H (y) be the number of H s in y, and T (y) be the number of T s in y. Then ( y) ( y) H T pyλ = λ λ ( ) H T In our example: y 1 = (T,H); y 2 =(T,T), hence: H (y 1 ) = T (y 1 )=1, H (y 2 ) =0, T (y 2 )=2 17
18 Example: 2 coin tosses - E step Thus T ( y1) H ( y1) 1 T H T H T ( y2) H ( y2) 2 pxy (, 2 λ) = λt λh = λt 1 3 ( ) (, ) (, 2 ) ( λλ ) ( ) 4 T H λt = 1 ( ) ( 2) ( 1) ( 2) 4 T y + 4 T y 4 H y + 4 H y T λh Lθ λ = p x y λ p x y λ = λ pxy (, λ) = λ λ = λ λ And in general: T = 7 /4 H = ¼ T L λ = λ λ θ H ( ) T H 18
19 EM for 2 coin tosses - M step Find λ* which maximizes L θ (λ) And as we already saw, λ H λ T T λ H H is maximized when: H T = ; λt = + + H T H T λ H = = 1 ; λ 7 8 T = = that is, λ* = (, ) and px ( λ*) = [The optimal parameters (0,1), will never be reached by the EM algorithm!] 19
20 EM for single random variable (dice) ow, the probability of each y ( (x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities λ 1,..,λ m. Let (y) = #(outcome occurs in y). Then m pyλ ( ) = λ = 1 Let be the expected value of (y), given x and θ: Then we have: ( y) =E( x,θ) = y p(y x,θ) (y), 20
21 L θ (λ) for one dice L θ ( λ) ( ) p( y x, θ ) = p y λ = y m p( y x, θ ) m ( y) p( y x, θ ) m ( y) y λ = λ = λ y = 1 = 1 = 1 which is maximized for λ = ' ' 21
22 EM algorithm for n independent observations x 1,, x n : Expectation step It can be shown that, if the x j are independent, then: n n j j j j j = (, ) (, ) = j= 1 j y j= 1 p y x θ y x n j= 1 1 j px ( θ ) j = y j j j j j py (, x θ ) ( y, x) 22
23 23 Example: The ABO locus A locus is a particular place on the chromosome. Each locus state (called genotype) consists of two alleles one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. q q q q q q o o o o b a b a o b o b b b b b o a o a a a a a / / / / / / / / / / / /,,,,, = = = = = = Suppose we randomly sampled individuals and found that a/a have genotype a/a, a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.
24 The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O)? The problem is that among individuals measured to have blood type A, we don t now how many have genotype a/a and how many have genotype a/o. So what can we do? 24
25 The ABO locus (Cont.) The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y: 25
26 The ABO locus (Cont.) The dice outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O. Each blood type is determined by two successive random sampling of alleles, which is an ordered genotypes pair this is the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a). So we have three parameters of one dice q a,q b,q o -that we need to estimate. We start with parameters θ = (q a,q b,q o ), and then use EM to improve them. 26
27 EM setting for the ABO locus The observed data x =(x 1,..,x n ) is a sequence of elements (blood types) from the set {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x 1, x 11 ). The hidden data (ie the y s) for each x j is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}. The parameters θ= {q a,q b, q o } are the probabilities of the alleles. 27
28 EM for ABO loci For each observed blood type x j {A,B,AB,O} and for each allele z in {a,b,o} we compute z (x j ), the expected number of times that z appear in x j. j j j j z( x ) = p( y x, θ ) z( y ) y j Where the sum is taen over the ordered genotype pairs y j, and z (y j ) is the number of times allele z occurs in the pair y j. eg, a (o,b)=0; b (o,b) = o (o,b) = 1. 28
29 EM for ABO loci The computation for blood type B: P(B θ) = P((b,b) θ) + p((b,o) θ) +p((o,b) θ)) = q 2 b + 2q b q o. Since b ((b,b))=2, and b ((b,o))= b ((o,b)) = o ((o,b))= o ((b,o))=1, o (B) and b (B), the expected number of occurrences of o and b in B, are given by: 2qq b o 2qq b o o( B) = p( y B, θ ) o( y) = = 2 pb ( θ ) q + 2q q y ( B) = p( y B, θ ) ( y) = b y b + 2 2qb 2qbqo 2 qb + 2qbqo b b o Observe that b (B)+ o (B) = 2 29
30 EM for ABO loci Similarly, P(A θ) = q a 2 + 2q a q o. 2 2qq a o 2qa + 2qq a o 2 a 2 a + 2 a o a + 2 a o o( A) =, ( A) = q q q q q q P(AB θ) = p((b,a) θ) + p((a,b) θ)) = 2q a q b ; a (AB) = b (AB) = 1 P(O θ) = p((o,o) θ) = q o 2 o (O) = 2 [ b (O) = a (O) = o (AB) = b (A) = a (B) = 0 ] 30
31 E step: compute a, b and o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. = #( A) ( A) + #( AB) ( AB) a a a = #( B) ( B) + #( AB) ( AB) b b b = #( A) ( A) + #( B) ( B) + #( O) ( O) o o o o ote that + + = 2 = 22 a b o M step: set λ*=( q a *, q b *, q o *) q * a = a 2 ; q * b = b 2 ; q * o = o 2 31
32 EM for a general discrete stochastic processes ow we wish to maximize lielihood of observation x with hidden data as before, ie maximize p(x λ)= y p(x,y λ). But this time experiment (x,y) is generated by a general stochastic process. The only assumption we mae is that the outcome of each experiment consists of a (finite) sequence of samplings of r discrete random variables (dices) Z 1,..., Z r, each of the Z i s can be sampled few times. This can be realized by a probabilistic acyclic state machine, where at each state some Z i is sampled, and the next state is determined by the outcome until a final state is reached. 32
33 EM for processes with many dices Example: In HMM, the random variables are the transmissions and emission probabilities: a l, e (b). x is the visible information y is the sequence s of states (x,y) is the complete HMM As before, we can redefine y so that (x,y) = y. s 1 s 2 s L-1 s L s i X 1 X 2 X L-1 X L X i 33
34 EM for processes with many dices Each random variable Z l (l =1,...,r)hasm l values z l,1,...z l,ml with probabilities {q l =1,...,m l }. Each y defines a sequence of outcomes (z l1, 1,...,z ln, n ) of the random variables used in y. In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence y j. Let l (y) = #(z l appears in y). 34
35 EM for processes with many dices Similarly to the single dice case, we have: pyλ ( ) r m = l l= 1 = 1 λ l l ( y) Define l as the expected value of l (y), given x and θ: Then we have: l =E( l x,θ) = y p(y x,θ) l (y), 35
36 L θ Q θ (λ) for processes with many dices l l= 1 = 1 l l= 1 = 1 p( y x, θ ) r ml p( y x, θ ) l ( y) ( λ) = p( y λ) = λl = y y l= 1 = 1 r ml l ( y) p( y x, θ ) r ml y l λ = λ where is the expected number of times that, l given x and θ, the outcome of dice l was : L θ = ( y) p( y x, θ ). l y ( λ) is maximized for λ l l = l ' l l ' 36
37 EM algorithm for processes with many dices Similarly to the one dice case we get: Expectation step Set l to E ( l (y) x,θ), ie: l = y p(y x,θ) l (y) Maximization step Set λ l = l / ( l ) 37
38 EM algorithm for n independent observations x 1,, x n : Expectation step It can be shown that, if the x j are independent, then: n n j j j j j l = (, ) l (, ) = l j= 1 j y j= 1 py x θ y x j l n j= 1 1 j px ( ) j = y j j j j py (, x θ ) ( y, x) l 38
39 EM in Practice Initial parameters: Random parameters setting Best guess from other source Stopping criteria: Small change in lielihood of data Small change in parameter values Avoiding bad local maxima: Multiple restarts Early pruning of unpromising ones 39
Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,
Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationHIDDEN MARKOV MODELS
HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationLast lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton
EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training
More informationHidden Markov Models
Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How
More informationUnit 1: Sequence Models
CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationHMM: Parameter Estimation
I529: Machine Learning in Bioinformatics (Spring 2017) HMM: Parameter Estimation Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Content Review HMM: three problems
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationAn introduction to PRISM and its applications
An introduction to PRISM and its applications Yoshitaka Kameya Tokyo Institute of Technology 2007/9/17 FJ-2007 1 Contents What is PRISM? Two examples: from population genetics from statistical natural
More informationCS Lecture 18. Expectation Maximization
CS 6347 Lecture 18 Expectation Maximization Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial
More informationNatural Language Processing
Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationHidden Markov Models. Three classic HMM problems
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationExpectation-Maximization (EM) algorithm
I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce
More informationThe E-M Algorithm in Genetics. Biostatistics 666 Lecture 8
The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as
More informationHidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:
Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More information6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm
6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationHidden Markov Models 1
Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one
More informationSpeech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationCINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012
CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012 Silvia Heubach/CINQA 2012 Workshop Objectives To familiarize biology faculty with one of
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More informationLagrange Multipliers
Calculus 3 Lia Vas Lagrange Multipliers Constrained Optimization for functions of two variables. To find the maximum and minimum values of z = f(x, y), objective function, subject to a constraint g(x,
More informationCOORDINATE GEOMETRY LOCUS EXERCISE 1. The locus of P(x,y) such that its distance from A(0,0) is less than 5 units is x y 5 ) x y 10 x y 5 4) x y 0. The equation of the locus of the point whose distance
More informationLecture 1. ABC of Probability
Math 408 - Mathematical Statistics Lecture 1. ABC of Probability January 16, 2013 Konstantin Zuev (USC) Math 408, Lecture 1 January 16, 2013 1 / 9 Agenda Sample Spaces Realizations, Events Axioms of Probability
More informationStephen Scott.
1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationLanguage Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations
Language Technology CUNY Graduate Center Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required hard optional Liang Huang Probabilities experiment
More informationMachine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation
Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform
More informationHIDDEN MARKOV MODELS IN SPEECH RECOGNITION
HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Wayne Ward Carnegie Mellon University Pittsburgh, PA 1 Acknowledgements Much of this talk is derived from the paper "An Introduction to Hidden Markov Models",
More informationEM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works
EM algorithm The example in the book for doing the EM algorithm is rather difficult, and was not available in software at the time that the authors wrote the book, but they implemented a SAS macro to implement
More informationf X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2
Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe,
More informationHidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -
More informationDirected Probabilistic Graphical Models CMSC 678 UMBC
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More informationStatistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014
Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationConsider the equation different values of x we shall find the values of y and the tabulate t the values in the following table
Consider the equation y = 2 x + 3 for different values of x we shall find the values of y and the tabulate t the values in the following table x 0 1 2 1 2 y 3 5 7 1 1 When the points are plotted on the
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationPlan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping
Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationHidden Markov Models
Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationHidden Markov Models. Hosein Mohimani GHC7717
Hidden Markov Models Hosein Mohimani GHC7717 hoseinm@andrew.cmu.edu Fair et Casino Problem Dealer flips a coin and player bets on outcome Dealer use either a fair coin (head and tail equally likely) or
More informationCSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:
Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed
More informationIntroduction to Bayesian Learning
Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline
More informationEECS E6870: Lecture 4: Hidden Markov Models
EECS E6870: Lecture 4: Hidden Markov Models Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY 10549 stanchen@us.ibm.com, picheny@us.ibm.com,
More informationOutline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks
Outline 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks Likelihood A common and fruitful approach to statistics is to assume
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.
School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:
More informationLecture notes for probability. Math 124
Lecture notes for probability Math 124 What is probability? Probabilities are ratios, expressed as fractions, decimals, or percents, determined by considering results or outcomes of experiments whose result
More informationChapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments
Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments We consider two kinds of random variables: discrete and continuous random variables. For discrete random
More informationHidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing
Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationWhat s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction
Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What
More informationLikelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University
Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate
More informationCSCE 471/871 Lecture 3: Markov Chains and
and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State
More informationHidden Markov Models. x 1 x 2 x 3 x K
Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)
More informationReview of Basic Probability
Review of Basic Probability Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA 01003 September 16, 2009 Abstract This document reviews basic discrete
More informationSTAT 430/510 Probability Lecture 7: Random Variable and Expectation
STAT 430/510 Probability Lecture 7: Random Variable and Expectation Pengyuan (Penelope) Wang June 2, 2011 Review Properties of Probability Conditional Probability The Law of Total Probability Bayes Formula
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 6: Hidden Markov Models (Part II) Instructor: Preethi Jyothi Aug 10, 2017 Recall: Computing Likelihood Problem 1 (Likelihood): Given an HMM l =(A, B) and an
More informationLECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50
LECTURE #10 A. The Hardy-Weinberg Equilibrium 1. From the definitions of p and q, and of p 2, 2pq, and q 2, an equilibrium is indicated (p + q) 2 = p 2 + 2pq + q 2 : if p and q remain constant, and if
More informationMACHINE LEARNING 2 UGM,HMMS Lecture 7
LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability
More information02 Background Minimum background on probability. Random process
0 Background 0.03 Minimum background on probability Random processes Probability Conditional probability Bayes theorem Random variables Sampling and estimation Variance, covariance and correlation Probability
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationEM (cont.) November 26 th, Carlos Guestrin 1
EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ
More informationLECTURE 1. 1 Introduction. 1.1 Sample spaces and events
LECTURE 1 1 Introduction The first part of our adventure is a highly selective review of probability theory, focusing especially on things that are most useful in statistics. 1.1 Sample spaces and events
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationLecture #5. Dependencies along the genome
Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3., Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger s and Nir Friedman s. Dependencies along the genome
More informationWhat is a random variable
OKAN UNIVERSITY FACULTY OF ENGINEERING AND ARCHITECTURE MATH 256 Probability and Random Processes 04 Random Variables Fall 20 Yrd. Doç. Dr. Didem Kivanc Tureli didemk@ieee.org didem.kivanc@okan.edu.tr
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationDistribusi Binomial, Poisson, dan Hipergeometrik
Distribusi Binomial, Poisson, dan Hipergeometrik CHAPTER TOPICS The Probability of a Discrete Random Variable Covariance and Its Applications in Finance Binomial Distribution Poisson Distribution Hypergeometric
More informationPhasing via the Expectation Maximization (EM) Algorithm
Computing Haplotype Frequencies and Haplotype Phasing via the Expectation Maximization (EM) Algorithm Department of Computer Science Brown University, Providence sorin@cs.brown.edu September 14, 2010 Outline
More informationAllele Frequency Estimation
Allele Frequency Estimation Examle: ABO blood tyes ABO genetic locus exhibits three alleles: A, B, and O Four henotyes: A, B, AB, and O Genotye A/A A/O A/B B/B B/O O/O Phenotye A A AB B B O Data: Observed
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationLOCUS. Definition: The set of all points (and only those points) which satisfy the given geometrical condition(s) (or properties) is called a locus.
LOCUS Definition: The set of all points (and only those points) which satisfy the given geometrical condition(s) (or properties) is called a locus. Eg. The set of points in a plane which are at a constant
More informationData Structures and Algorithm Analysis (CSC317) Randomized algorithms
Data Structures and Algorithm Analysis (CSC317) Randomized algorithms Hiring problem We always want the best hire for a job! Using employment agency to send one candidate at a time Each day, we interview
More informationLecture 6: Entropy Rate
Lecture 6: Entropy Rate Entropy rate H(X) Random walk on graph Dr. Yao Xie, ECE587, Information Theory, Duke University Coin tossing versus poker Toss a fair coin and see and sequence Head, Tail, Tail,
More informationVL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16
VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationGoodness of Fit Goodness of fit - 2 classes
Goodness of Fit Goodness of fit - 2 classes A B 78 22 Do these data correspond reasonably to the proportions 3:1? We previously discussed options for testing p A = 0.75! Exact p-value Exact confidence
More information26. LECTURE 26. Objectives
6. LECTURE 6 Objectives I understand the idea behind the Method of Lagrange Multipliers. I can use the method of Lagrange Multipliers to maximize a multivariate function subject to a constraint. Suppose
More informationK-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1
EM Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 19 th, 2007 2005-2007 Carlos Guestrin 1 K-means 1. Ask user how many clusters they d like. e.g. k=5 2. Randomly guess
More informationProbability and Independence Terri Bittner, Ph.D.
Probability and Independence Terri Bittner, Ph.D. The concept of independence is often confusing for students. This brief paper will cover the basics, and will explain the difference between independent
More informationExample: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding
Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)
More informationMATH MW Elementary Probability Course Notes Part I: Models and Counting
MATH 2030 3.00MW Elementary Probability Course Notes Part I: Models and Counting Tom Salisbury salt@yorku.ca York University Winter 2010 Introduction [Jan 5] Probability: the mathematics used for Statistics
More informationLearning Bayesian Networks (part 1) Goals for the lecture
Learning Bayesian Networks (part 1) Mark Craven and David Page Computer Scices 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some ohe slides in these lectures have been adapted/borrowed from materials
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationQualifier: CS 6375 Machine Learning Spring 2015
Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet
More information