Chap. 3 Marov chans and hdden Marov modes 2 Bontegence Laboratory Schoo of Computer Sc. & Eng. Seou Natona Unversty Seou 5-742 Korea Ths sde fe s avaabe onne at http://b.snu.ac.r/ Copyrght c 2002 by SNU CSE Bontegence Lab
Specfyng the Mode HMM The desgn of the structure What states there are and how they are connected. The assgnment of parameter vaues The transton and emsson probabtes a and e b. Copyrght c 2002 by SNU CSE Bontegence Lab 2
The Framewor for arameter Estmaton We have a set of tranng sequences that we want the mode to ft we. Gven the ndependent tranng sequences n. The og probabty of these sequences gven the mode and ts parameters the og ehood of the mode. The parameter vaue whch mamzes the above probabty the og ehood s chosen. n n j... og... og n j Copyrght c 2002 by SNU CSE Bontegence Lab 3
Estmaton when the State Sequence s Known It s easer to estmate the probabty parameters when the state paths are nown for a the tranng eampes. Gven a set of genomc sequences n whch the CpG sands were aready abeed based on the epermenta data. An HMM for the predcton of secondary structure of protens wth tranng sequences obtaned from the set of protens wth nown structures. An HMM predctng genes from genomc sequences where the transcrpt structure has been determned by cdna sequencng Copyrght c 2002 by SNU CSE Bontegence Lab 4
The Mamum Lehood Estmator The mamum ehood estmator Cacuate the number of tmes n the tranng sequences for each transton A and each emsson E b. a A and b A ' ' ' E b' b In the case of nsuffcent data overfttng e A number of transtons to n tranng data r E b number of emssons of b from n tranng data r b E b ror nowedge from users pseudocounts Cf. Drchet prors n Bayesan statstcs Copyrght c 2002 by SNU CSE Bontegence Lab 5
Estmaton when aths are Unnown A the standard agorthms for optmzaton of contnuous functons can be used. The Baum-Wech agorthm nforma descrpton. Estmate the A and E b by consderng probabe paths for the tranng sequences usng the current vaues of a and e b. 2. Estmated A s and E b s are used for the update of a s and e b s. 3. Above process s terated unt some stoppng crteron s reached. The overa og ehood of the mode s ncreased by the teraton oca mama There est many oca mama. strongy depends on the startng pont of the agorthm. Copyrght c 2002 by SNU CSE Bontegence Lab 6
Copyrght c 2002 by SNU CSE Bontegence Lab 7 Estmaton of the Estmaton of the A Vaue Vaue The probabty that a s used at poston n sequence :............................ 2 2 b e a f L L L L
Copyrght c 2002 by SNU CSE Bontegence Lab 8 Estmaton of the Estmaton of the E b Vaue Vaue The probabty that e b s used at poston n sequence :................ b f b b b b b L L Ony the case where b s consdered.
Estmaton of A and E b Vaues The Baum-Wech agorthm cacuates A and E b as the epected number of tmes each transton or emsson s used gven the tranng sequences. A j j j f ae b j j E b j j { j f j b} b j Havng cacuated the above epectatons the new mode parameter vaues are cacuated. We are convergng n a contnuous-vaued space. Stoppng crteron: the average change n the og ehood. Copyrght c 2002 by SNU CSE Bontegence Lab 9
The Baum-Wech Agorthm Intazaton: c arbtrary mode parameters Recurrence: Set a the A and E varabes to ther pseudocount vaues r or to zero For each sequence j n Cacuate f for sequence j usng the forward agorthm Cacuate b for sequence j usng the bacward agorthm Add the contrbuton of sequence j to A and E. Cacuate the new mode parameters. Cacuate the new og ehood of the mode. Termnaton: Stop f the change n og ehood s ess than some predefned threshod or the mamum number of teratons s eceeded. Copyrght c 2002 by SNU CSE Bontegence Lab 0
The Baum-Wech Agorthm Cont d The pseudocount vaues r coud not be nterpreted n terms of Drchet prors rgorousy. The Baum-Wech agorthm s a speca case of a very powerfu genera approach to probabstc parameter estmaton caed the EM epectaton-mamzaton agorthm. Copyrght c 2002 by SNU CSE Bontegence Lab
The Vterb Tranng The most probabe path for each tranng sequence s derved by the Vterb agorthm. Ths nformaton s used n the re-estmaton process. Once the nformaton s gven t s the same as the tranng when the state path s nown. The agorthm converges precsey because the assgnment of paths s a dscrete process. The Baum-Wech mamzes the true ehood. n The Vterb tranng fnds the vaue of that mamzes the contrbuton to the ehood n * * n from the most probabe paths for a the sequences. In genera the Baum-Wech performs better than the Vterb tranng. When the prmary use of the HMM s Vterb decodng. Copyrght c 2002 by SNU CSE Bontegence Lab 2
Eampe: Casno art 5 Tranng Sequences: Copyrght c 2002 by SNU CSE Bontegence Lab 3
The Mode Estmated by the Baum-Wech Agorthm Estmated probabtes are somewhat dfferent. Due to the probem of oca mama. The mted amount of data does not permt to estmate the accurate parameter vaues. Copyrght c 2002 by SNU CSE Bontegence Lab 4
The Mode Learned from 30000 Ros The og-odds per ro assumng a far de 300 a oaded de mode og a far de mode The correct mode: 0.0 bts Mode from 300 ros: 0.097 bts Mode from 30000 ros: 0.097 bts Copyrght c 2002 by SNU CSE Bontegence Lab 5
Estmaton of the HMM arameters for the Cassfcaton robem We have to tran the modes of one cass separatey from the mode of the other cass and then combne them nto a arger HMM. Ths separate estmaton can be qute tedous. more than two casses The estmaton of the transtons s not a smpe countng probem. when the transtons between the submodes are ambguous. Copyrght c 2002 by SNU CSE Bontegence Lab 6
Modeng of Labeed Sequences The combned mode of a the casses Each state s assgned a cass abe. The abe on the observaton L. y y L. In the Baum-Wech agorthm ony the vad paths are aowed. Copyrght c 2002 by SNU CSE Bontegence Lab 7
Dscrmnatve Estmaton Uness there are ambguous transtons between submodes the above estmaton procedure gves the same resut as f the submodes were estmated separatey by the Baum-Wech agorthm and then combned wth approprate transtons afterwards. ML arg ma y Our prmary nterest s n obtanng good predctons of y: The condtona mamum ehood CML arg ma y Copyrght c 2002 by SNU CSE Bontegence Lab 8
The Mamum Mutua Informaton y y / y: the probabty cacuated by the forward agorthm for abeed sequences. : the probabty cacuated by the standard forward agorthm dsregardng a the abes. There s no EM agorthm for optmzng ths ehood. Copyrght c 2002 by SNU CSE Bontegence Lab 9
Choce of Mode Topoogy Fuy-connected modes The severe probem of oca mama rarey used n practce The ess constraned the mode s the more severe the oca mamum probem becomes. There est methods whch attempt to adapt the more topoogy based on the tranng data. Successfu HMMs are constructed based on the nowedge about the probem under nvestgaton. E the mode of CpG sands To dsabe the transton from state to just set a to be zero. Copyrght c 2002 by SNU CSE Bontegence Lab 20
Duraton Modeng The nuceotde dstrbuton does not change for a certan ength of DNA The CpG sands mode or the dshonest casno The probabty of stayng n the state for resdues: resdues pp geometrc dstrbuton eponenta decay coud be napproprate n some appcatons A mnmum ength of 5 resdues An eponentay decayng dstrbuton over onger sequences. Copyrght c 2002 by SNU CSE Bontegence Lab 2
Duraton Modeng Cont d Any dstrbuton of engths between 2 and 0 Copyrght c 2002 by SNU CSE Bontegence Lab 22
A More Subte Modeng for Non-Geometrc Length Dstrbuton An array of n states: p p p p p p p p p For any path of ength n the transton probabty s p -n - p n The probabty of a the possbe sequences of ength. n n n p p Copyrght c 2002 by SNU CSE Bontegence Lab 23
The Negatve Bnoma Dstrbuton p 0.99 n 5 For a contnuous Marov process: Erang dstrbuton and phase-type dstrbuton It s possbe to mode the ength dstrbuton epcty. Copyrght c 2002 by SNU CSE Bontegence Lab 24
Sent States The states that do not emt any symbo Sent or nu states begn and end states Eampe n Chapter 5 The Marov mode where a states n a chan need to be connected to a states ater n the chan. If the ength of the chan s 200 about 20000 transtons Too arge to be reaby estmated from reastc data sets. Copyrght c 2002 by SNU CSE Bontegence Lab 25
The Forward Connected Mode In order to aow for arbtrary deetons: The number of parameters requred for the above mode consstng of 200 states s: 99 99 98 / 2 20000. Copyrght c 2002 by SNU CSE Bontegence Lab 26
A arae Chan of Sent States A mode wth sent states The number of parameters of the mode correspondent to the above forward connected mode s 99 98 98 97 800 The reducton of the parameters aso reduces the representaton power. 5 and 2 4 wth hgh probabtes whe 4 and 2 5 wth ow probabtes. Copyrght c 2002 by SNU CSE Bontegence Lab 27
Etensons to the Forward Agorthm So ong as there are no oops consstng entrey of sent states t s easy to etend a the HMM agorthms to ncorporate them. For the forward agorthm f :. For a rea states cacuate f as before from f for states. 2. For any sent state set f to Σ f a for rea states. 3. Startng from the owest numbered sent state add Σ f a to f for a sent states <. If there are oops entrey of sent states Emnate the sent states from the cacuatons the fuy connected mode Copyrght c 2002 by SNU CSE Bontegence Lab 28
Hgh Order Marov Chans An nth order Marov process 2...... An nth order Marov chan over some aphabet A s equvaent to a frst order Marov chan over the aphabet A n of n-tupes. Because n... n... n... n Copyrght c 2002 by SNU CSE Bontegence Lab 29
A Second Order Marov Chans A second order Marov chan for sequences of A and B ABBAB AB BB BA AB The equvaent frst order Marov chan: AA AB BA BB BB or BA foowng BA or AA s dsaowed. Sometmes the framewor of hgh order mode s convenent. Copyrght c 2002 by SNU CSE Bontegence Lab 30
Fndng roaryotc Genes Genes of proaryotes bactera: Very smpe one-dmensona structure Start codon : codons for amno acds : stop codon Copyrght c 2002 by SNU CSE Bontegence Lab 3
Open Readng Frames Fndng good gene canddates from DNA stretches: Start wth one of the possbe start codons. Contnung wth a number of non-stop codons. Endng wth one of the possbe stop codons. Open readng frames ORFs Overappng ORFs The same stop codon wth dfferent start codons. There are many more ORFs than rea genes. Dscrmnate between a non-codng ORF and a rea gene. Copyrght c 2002 by SNU CSE Bontegence Lab 32
Eampe: DNA from E. Co The tranng data set 00 genes more than 00 nuceotdes ong 900 for tranng the mode and 200 for testng the traned mode. A frst order mode just as for the CpG sands was estmated from the tranng data In the test set 6500 ORFs wth a ength of more than 00 bases were found. ORFs that share the stop codon wth a nown rea genes were not ncuded. Copyrght c 2002 by SNU CSE Bontegence Lab 33
Hstograms of the Log-Odds per Nuceotde The nu mode for og-odds: the smpes mode wth the probabty for each nuceotde equa to the frequency by whch t occurs n a the data. Average for genes: 0.08 average for NORFs: 0.009 Dscrmnaton wth ths mode s neary mpossbe. Copyrght c 2002 by SNU CSE Bontegence Lab 34
A Marov Chan Consstng of Codons A the sequences are transformed to sequences of codons. A 64-state frst order Marov chan was estmated. The nu mode: a unform dstrbuton over codons Copyrght c 2002 by SNU CSE Bontegence Lab 35
Inhomogeneous Marov Chans Usng the poston nformaton n the codon Three modes for the poston 2 and 3. a a 2 a 3 a a GENEMARK gene-fndng program http://opa.boogy.gatech.edu/genemar/ 2 2 23 3 4 45 56 Inhomogeneous Marov chans are used. Etensons to the emsson probabtes e b b... b b... b... n n n Copyrght c 2002 by SNU CSE Bontegence Lab 36
Numerca Stabty of HMM Agorthms Many cacuatons n Vterb forward and bacward agorthms Mutpyng many probabtes A mode of genomc sequences wth 00000 bases wth a typca emsson and transton probabty of 0. The resutng probabty w be 0-00000. Most computers coud not dea wth such a sma number. Copyrght c 2002 by SNU CSE Bontegence Lab 37
The Log Transformaton og 0 0-00000 -00000 For the Vterb agorthm V e~ ma V a~ For the forward and bacward agorthms The ogarthm of a sum of probabtes eponentaton s requred. ~ r og p q ~ ~ Appromated by p og p and q og q ~ r ogep ~ p ep q ~ nterpoaton from a tabe ~ p og ep q ~ ~ p Copyrght c 2002 by SNU CSE Bontegence Lab 38
Copyrght c 2002 by SNU CSE Bontegence Lab 39 Scang of robabtes Scang of robabtes To rescae the f and b varabes ~ ~ ~ ~ ~ j j f a f e s a f e s f s f f e b a s b ~ ~
Further Readng Basc ntroducton to HMMs: [Rabner and Juang 986; Krogh 998] Eary appcatons of HMM-e modes to sequence anayss: [Borodovsy et a. 986a 986b 986c] GENEMARK genefnder program: [Borodovsy and McInnch 993] EM agorthm for modeng proten bndng motfs: [Cardon and Stormo 992] Combnng neura networs and HMMs: [Stormo and Hausser 996; Kup et a. 996; Reese et a. 997; Burge and Karn 997] HMMs for modeng compostona dfferences between DNA from mtochondra and from the human X chromosome and bacterophage ambda: [Church 989] A three-state HMM for predcton of proten secondary structure: [Asa Hayamzu and Handa 993] A HMM wth ten states n a rng for modeng an oscatory pattern n nuceosomes [Bad et a. 996] Copyrght c 2002 by SNU CSE Bontegence Lab 40