CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte condtonal ndependence relatonshps as X A X B X C We are lookng for factorzatons of the type: G d separaton I(G) BayesBall P F actorzaton I(P) P (X A X B X C ) = P (X A X C ).P (X B X C ) If for a gven dstrbuton P we can wrte I(G) I(P ) P then P s markov w.r.t. G. A fully connected graph has no condtonal ndependence. Also, If P s markov w.r.t. G, then we can wrte, P (x 1 n ) = P (x x pa() ) N.e. X X /P a() X P a() Markov Blanket: Markov blanket of a node s the only knowledge needed to predct the behavor of that node. Fgure 1: Markov Blanket P (x x mb() ) = P (x x ) 1
mb(5) = {3, 2, 6} Markov Blanket of a node ncludes ts parents, chldren and co-parents. mb() = P a() Chldren() Co parents() Ths relaton s mportant because once we know markov blanket of a node, we can gnore the rest of the nodes. In the above drected graph, P (x 5 x 5 ) = P (x 5, x 5 ) P (x 5 ) P (x 5 x 3 ).P (x 6 x 2 x 5 ) We are prmarly nterested n dong nference on the Bayesan Networks.e. calculatng posteror probablty P (x n x v ) for a gven set of varables v. 2 Exact Inference A partcular class of nference we wll be focusng on s called Sum-Product or Belef Propagaton(Chan or Trees). In partcular case of chan, Forwards-Backwards Algorthm. 2.1 Chan Structured Graph We wll see ths smplest case to see how the algorthm works. Ths s a frst order markov chan. The dstrbuton correspondng to ths markov chan s, m P (z 1 m ) = P (z 1 ) P (z z 1 ) z [] z szevector P (z z 1 ) matrx O(m 2 ) mb(j) = {j 1, j + 1} To descrbe a concrete example of ths, we wll look at admxture models. =2 2.2 Hdden Markov Model(Extenson of chan structure) 2
P (x 1 m, z 1 m ) = P (z 1 m ).P (x 1 m z 1 m ) m = P (z 1 ) P (z z 1 ) =2 Transton m P (x z ) =1 Emsson\Observaton In chan structure we have local dependence whch s not desred. Here we have long range dependence. Advantage of workng wth ths model s that we don t have to mantan m models. Admxture: Fgure 2: Admxture Model 2.3 Inference Problems on HMM 2.3.1 Flterng We are gettng data at one pont and we want to nfer hdden varables at that pont P (z t x 1 t ). Ths method s onlne. 2.3.2 Smoothng We have observed all the data and we want to nfer hdden varables P (z t x 1 m ). Ths s batch predcton task. 2.3.3 Predcton Gven data at some pont, how wll the next observaton look lke P (z t+1 x 1 t ). 2.3.4 MAP estmaton To calculate z 1 m = argmax z1 m P (z 1 m x 1 m ) 3
2.3.5 Margnal Lkelhood \Evdence To calculate P (x 1 t ) Let s look at one concrete example. P (z 1 ) = P (z 1, z 2, z 3 ) O( 3 ) = P (z 1 )P (z 2 z 1 ) P (z 3 z 2 ) z 2 z 3 O( 2 ) O( 2 ) Now we ntroduce observatons, Ths s a smoothng problem. P (z 1 x 1 3 ) = P (z 1, z 2, z 3 x 1 3 ) P (x 1 3 z 1 3 ).P (z 1 3 ) = P (x 1 3 ) P (z 1 )P (x 1 z 1 )P (z 2 z 1 )P (x 2 z 2 )P (z 3 z 2 )P (x 3 z 3 ) Ignorng denomnator as t s just for normalzaton = z 2 P (z 1 )P (x 1 z 1 )P (z 2 z 1 )P (x 2 z 2 ) z 3 P (z 3 z 2 )P (x 3 z 3 ) = z 2 P (z 1 )P (x 1 z 1 )P (z 2 z 1 )P (x 2 z 2 )M 3 2 (z 2 ) M 2 1 (z 1 )P (z 1 )P (x 1 z 1 ) Let s apply these des to HMM. 4
Forwards Backwards Algorthm: Goal s to compute smoothng probablty for every nstant t. γ t (j) = P (z t = j x 1 m ) P (x 1 m z t = j)p (z t = j) = P (x 1 t z t = j)p (x t+1 m z t = j)p (z t = j) = P (z t = j x 1 t ) P (x t+1 m z t = j) Forward Pass α t(j) Backward Pass β t(j) α t (j) P (z t = j x 1 t ) We want to wrte ths as functon of α t 1 whch wll allow us to use dynamc programmng. α t (j) = P (x t x 1 t 1, z t = j)p (z t = j x 1 t 1 ) P (x t x 1 t 1 ) P (x t z t = j) P (z t = j, z t 1 = x 1 t 1 ) P (z t = j z t 1 =, x 1 t 1 ) P (z t 1 = x 1 t 1 ) α t 1() α t (j) P (x t z t = j) P (z t = j z t 1 = )α t 1 () = ψ t (j) ψ t 1,t (, j) α t 1 () O( 2 ) M t 1 t(j) Intalzaton: α 1 (j) = P (z 1 = j x 1 ) P (x 1 z 1 = j) P (z 1 = j) ψ 1(j) ψ 0,1(j) β t 1 (j) = P (x t m z t 1 = j) β t (j) P (x t+1 m z t = j) = P (x t m, z t = k z t 1 = j) = P (x t m z t = k, z t 1 = j)p (z t = k z t 1 = j) = P (x t z t = k) P (x t+1 m z t = k) P (z t = k z t 1 = j) β t(k) = β t (k)ψ t (k)ψ t 1,t (j, k) O( 2 ) 5
Intalzaton: β m (j) = 1 Total tme complexty = O(M 2 ) γ t (j) α t (j)β t (j) Another problem that can be solved s MAP(Vterb Algorthm). Ths s smlar to forwards-backwards algorthm. Partal Computaton: z = argmax z 1 m P (z 1 m x 1 m ) δ t (j) = max {z 1 t 1} P (z 1 t 1, z t = j x 1 t ) = max {z 1 t 2,} P (z 1 t 2, z t 1 =, z t = j x 1 t 1, x t ) max P (x 1 t 1, x t z t = j, z t 1 =, z 1 t 2 )P (z t = j, z t 1 =, z 1 t 2 ) = max {P (x t z t = j)p (x 1 t 1 z t 1 =, z 1 t 2 )P (z t = j z t 1 = )P (z t 1 = z 1 t 2 )} = max {P (z t 1 =, z 1 t 2 x 1 t 1 ) P (x t z t = j)p (z t = j z t 1 = )} δ t 1 δ t () = max δ t 1 ()ψ t (j)ψ t 1,t (, j) O( 2 ) Frst Observaton: δ 1 (j) = P (z 1 = j x 1) Last Observaton: δ m () = max δ m () = max P (z 1 m x 1 m ) Learnng \Parameter Estmaton: Observed Data : {x (1),, x (n) } θ = (π, A, B) n ll(θ) = log P (x () θ) =1 ˆθ = argmax ll(θ) θ E-M Algorthm (Baum - Welch): E-step wll nvolve computng posteror probabltes P (z t x 1 m, θ (t) ) Posteror probabltes at two adjacent nstants n tme: P (z t 1, z t x 1 m, θ (t) ) whch we can calculate easly gven α and β. M- step: Soft classfcaton based on above probabltes. Ths s a non-convex problem. 6
One specal case of HMM (Useful for mputaton): Factoral HMM: Fgure 3: Factoral HMM z (t) j {0, 1} X = Z (1) + Z (2) Intally both the markov chans are ndependent but they become dependent after observng x. Exact nference fals n ths case. 3 Concluson Hdden Markov models are generatve models, n whch the jont dstrbuton of observatons and hdden states, or equvalently both the pror dstrbuton of hdden states (the transton probabltes) and condtonal dstrbuton of observatons gven states (the emsson probabltes), are modeled. HMMs are useful where the flexblty of decson process could be perfectly mplemented to acheve better performance. 7