Bonformtcs ecture Notes Announcements Remember: Project Proposls re due Aprl. Clss 22 Aprl 4, 2002 A. Hdden Mrov Models. Defntons Emple - Consder the emple we tled bout n clss lst tme wth the cons. However, ths tme nsted of 3 cons wth heds or tl outcome, ssume tht you hve 4 sded de wth the letters A, C, T, G on the fces. You roll the de n-tmes for sequence of length n. Emple Ths emple cn be etended to proten sequences f we consder 20-sded de wth dfferent mno cds on ech fce. Alterntvely, we cn consder the 64 codons. To see how Hdden Mrov Models cn be used, consder the followng emple: The occsonlly dshonest csno emple - In csno, the use fr de most of the tme, but occsonlly they swtch to loded de. The loded de hs probblty 0.5 of s nd 0. for the numbers to 5. Assume tht the csno swtches from fr to loded de wth probblty 0.05 before ech roll nd the probblty of swtchng bc s 0.. Then the swtch between the dce s Mrov process. In ech stte the outcome of roll hs dfferent probbltes. Ths s n emple of hdden Mrov Model. In ths emple, f we just see the outcome (emssons of the de rolls, the stte sequence s hdden. The emsson probbltes (e (b re the probbltes of ech outcome n prtculr outcome b when n stte. The trnston probbltes ( l s the probblty tht the stte wll chnge from stte to stte l. These cn be wrtten s follows: l Pr[ l ] e ( b Pr[ b ] We cn wrte down the jont probblty of n observed sequence nd stte sequence :
0 ( e Pr[, ] ( + 2. The Vterb Algorthm: Computng the most probble stte pth oong t the emtted sequence, we need to hve wy to determne the most lely sequence of sttes tht yelds the dt we re consderng, e. * rg m Pr[, ] The most probble pth * cn be found recursvely. et the probblty of the most probbly pth endng n stte wth observton, v (, be now for ll the sttes. These probbltes cn be clculted for observton + by v l + ( + el ( m( v( l wth the ntl condton v 0 (begnnng stte. After runnng through terton for ll observtons n the sequence, the ctul stte sequence cn be found by usng the trcebc. (see the lgorthm n Durbn et l or Clote nd Bchofen On more prctcl note, often when we re tng the product of mny probbltes the numbers get relly smll. Ths cn cuse underflow errors on computer. To remedy ths, use the log(v l ( nsted. Fgure 3.5 (n Durbn et l shows dt generted n smulton, the ctul de used nd the predcton by the Vterb lgorthm. Notce tht whle t s not ectly correct, t s close. 3. Probblty of sequence n Hdden Mrov Model Erler n the semester, we showed how to clculte the probblty of sequence derved from Mrov Chn Pr[ ] Pr[ ] 2 where Pr[ ]. We wnt to be ble to do the sme for the Hdden Mrov Model. Ths s slghtly more complcted becuse mny dfferent stte pths cn gve rse to the sme sequence. Ths requres tht we sum up the probbltes to obtn the full probblty of sequence. Pr[ ] Pr[,] Becuse the number of pths ncreses eponentlly wth the length of the sequence, enumertng ll pths n too computtonlly epensve. Insted,
we cn use equton (* bove for the most probble pth * s ppromton for Pr[]. Alterntvely we cn use dynmc progrmmng method smlr to the Vterb lgorthm clled the forwrd lgorthm. The term nlogous to Vterb vrble v ( n the forwrd lgortm s f ( Pr[..., ] s the probblty of the observed sequence up untl nd ncludng, wth the requrment tht. The recurson equton s l( + el ( + f f ( (see the lgorthm n Durbn et l or Clote nd Bchofen Smlr to the Vterb lgorthm, one mght wor n log spce to vod underflow errors. 4. The bcwrd lgorthm nd posteror stte probbltes Sometmes we mght wnt to now the most probble stte for n observton. Ths cn be re-phrsed s the probblty tht observton cme from stte gven the observed sequence (Pr[ ]. Ths s clled the posteror probblty. To do ths we use the bcwrd lgorthm whch s bt ndrect. Frst, the probblty of producng the entre observed sequence wth the th symbol beng produced by stte, e. l Pr[, Pr[..., ] Pr[..., ]Pr[ + ]Pr[... + ]......, ] (** becuse everythng fter only depends on the stte t. The frst term s the ( from the forwrd lgorthm. The second term s f b ( Pr[ +... ] l whch s obtned by bcwrd recurson strtng t the end of the sequence. (see bcwrd lgorthm Durbn et l or Clote nd Bcofen Once we hve computed f ( from the forwrd lgorthm, b ( from the bcwrd lgorthm nd Pr[] from ether, we re-wrte equton (** s then we cn clculte Pr[, ] f ( b (
Pr[ Pr[, ] ] Pr[ ] f ( b ( Pr[ ] by the defnton of condtonl probblty. (see Fg 3.6 5. Posteror decodng When mny dfferent pths hve close to the sme probblty s the most probble one, t s not resonble to only choose the most probble one. One pproch s to defne second stte sequence ˆ rgm Pr[ ] Ths stte sequence mght be more pproprte when we re nterested n the stte ssgnment t prtculr pont, rther thn the complete pth. In Fgure 3.7 (Durbn et l, the dshonest csno emple s used wth the probblty of swtchng from fr to loded de s 0.0. In ths cse the Vterb lorgthm never vsts the loded de stte. However, f we loo t the posteror probbltes, t s cler where the loded de s used. B. Prmeter Estmton for Hdden Mrov Models When we re determnng Hdden Mrov Model, the frst step s to defne the model n the frst plce. In our emple, we now the form of the model. Sometmes we do not now ths. There re two prts to desgnng model: defnng the model structure or topology nd 2 ssgnng the prmeter vlues for the emsson nd trnston probbltes. To do ths we need to hve set of emple sequences clled the trnng set whch we wll cll,, n. These re ssumed to ndependent so tht the jont probblty of ll sequences s smply the product or the probbltes of the ndvdul sequences. Snce ths s product, t mes sense to wor n log spce.. Estmtng when the stte sequence s nown When the pths re ll now, we cn count the number of tmes tht ech prtculr trnston or emsson s used n the trnng set. We cn cll these A l nd E (b, respectvely. We cn compute the mmum lelhood estmtors for l nd e (b by Al l nd A l l E ( b e ( b (*** E ( b b
In order for ths to wor, we hve to hve enough dt n the trnng set so tht our problem s not underdetermned. 2. Bum Welch nd Vterb trnng: Estmton when the pths re unnown Defnton The Bum-Welch Score s the lelhood O (M tht model M genertes O s defned by O ( M Pr[ O M ] Pr[ O, M ] Pr[ O, M ] Pr[ M ] Defnton The Vterb Score s the sme s the Bum Welch Score ecept wth the mmum n plce of sum over ll pths of sttes Bum-Welch Method Ths method s very common method. It estmtes A l nd E (b by consderng the probble pths for the trnng sequences usng the current vlues of l nd e (b. Then Eq. (*** s used to derve new vlues of the s nd e s. Clote nd Bcofen show the proof tht the overll log lelhood ncreses wth ech terton nd tht the process converges to locl mmum. One must relze tht snce t s locl mmum, the locl mmum you end up n s strongly dependent on the ntl vlues of the prmeters. b Epectton Mmzton c Bld-Chuvn Grdent Descents C. Applctons Multple sequence lgnment b Proten motfs c Euryotc DNA promoter regons