A marginal mixture model for discovering motifs in sequences

Size: px

Start display at page:

Download "A marginal mixture model for discovering motifs in sequences"

Augusta Manning
6 years ago
Views:

1 A margna mxture mode for dscoverng motfs n sequences E Voudgar and Konstantnos Bekas Abstract. In ths study we present a margna mxture mode for dscoverng probabstc motfs n categorca sequences. The proposed method s based on a genera framework for deveopng, extendng and margnazng expressve motf modes that encapsuates spata nformaton. Two aternatve schemes for constructng expressve modes are descrbed, the extend and the margna approach. The EM agorthm s apped for estmatng the mode parameters, whe an ntazaton procedure s used based on the known K-means agorthm. Numerca experments on varous smuated and rea sets of sequences demonstrate the advantages of the proposed approach n comparson wth the basc maxmum kehood wth the EM agorthm scheme and the Gbbs sampng approach. Introducton Dscoverng motfs n sequences s an mportant and attractve pattern recognton probem found n severa appcaton areas, such as bonformatcs, web mnng, etc. Gven a set of dscrete sequences, a probabstc motf (or pattern) can be seen as a common substrng that s nosy repeated n dfferent ocatons n sequences. The motfs are hghy conserved resdues present n actve stes of sequences and thus they have powerfu dscrmnatve abtes n the sense of creatng features n cassfcaton tasks [4, ]. In the terature there are varous methods that have been ntroduced for sovng ths probem, cassfed accordng to the type of motfs they produce. The Gbbs sampng [9, ], the MEME [], the SAM [8], the BoProspector [], the Greedy EM [3] and the LOGOS [3], consttute representatve statstca approaches for dscoverng shared motfs n a set of sequences. Most of them use probabstc generatve modes to mode motfs as stochastc strng patterns randomy embedded n a smpe background. In such a settng, the motf dscovery probem can be seen as a standard mssng-vaue nference and beng converted nto a parameter estmaton probem. In partcuar, a methods formuate the probem usng ether mxture of mutnoma modes or hdden Markov modes, and appy standard methods such as the Expectaton-Maxmzaton (EM) agorthm [6, 2] or Gbbs sampng[9, ] schemes to estmate the motf mode parameters. The most common way for stochastcay representng a motf s through the poston weght matrx (PWM) that gves the dstrbuton of aphabet characters at every poston assumng that the postons wthn a motf are ndependent. Nevertheess, the smpe PWM mode descrpton s senstve to nose and random or trva recurrent patterns (repettons of short substrngs), and s unabe to capture potenta ste dependences nsde the motf [3]. Varous methods Department of Computer Scence, Unversty of Ioannna, 4 Ioannna, Greece, ema: {evoudga, kbekas}@cs.uo.gr have been deveoped to ncorporate spata nformaton n motfs but, mosty, they are motf specfc and hande ony speca shape motfs [3]. Our am n ths study s to deveoore compact expressve motf modes that encoses more naturay the mportant spata nformaton of motf postons. Ths s done by ntroducng a margna mxture mode under two schemes that encompass the motf neghborhood. In the frst approach (the extended mode), the PWM s expanded for capturng the entre neghborhood of a motf and not a snge K-mer. Aternatvey (the margna mode), the PWM not ony can ft the motf occurrences, but aso can partay ft the substrngs that overap wth them and beong to the same motf neghborhood. These new mode capabtes strengthen the expresson eve of a motf mode by ndrecty ntroducng usefu spata operators. Fnay, another advantage s that the proposed approach s ess senstve to the ntazaton that makes the EM agorthm apped for estmatng mode parameters to be ess affected on parameter ntazaton. We have made experments wth smuated sets of sequences as we as known boogca benchmarks. Comparsons have been aso made wth rva methods: the basc maxmum kehood approach wth the EM agorthm [] and the Gbbs sampng [9, ], where we set up three evauaton crtera. In secton 2 we present the basc methodoogy for dscoverng probabstc motfs wth mxture modes and the EM agorthm for estmatng mode parameters, and then the proposed margna mxture mode and ts expressve motf modes. eca care s aso gven to the ntazaton where we present a scheme based on the k- means custerng agorthm over the mode parametrc space. Secton 3 presents expermenta resuts obtaned by our method and comparatve approaches n varous datasets. Fnay, we summarze and gve some concuded remarks n secton 4. 2 Appyng mxture modes for motfs dscoverng Consder a fnte set Σ {c,..., c M } consstng of M ndvdua characters. An arbtrary strng over the set Σ s any sequence S j {s jk } L j k of ength Lj, where s jk Σ denotes the character at the k-th poston of the j-th sequence. Now, et S {S,..., S N } be a set of N strngs of ength L,..., L N, respectvey. A common subsequence of ength K that s repeated at dfferent stes among the nput sequences of set S s caed motf. In our study we w consdered that the motf ength K s known. The appcaton of mxture modes needs a necessary preprocessng step to be made frst, where we coect a the possbe substrngs over S of ength equa to K. Ths can be done by sdng a wndow of sze K to every sequence S j, obtanng a set of L j K + substrngs. Each substrng ndcates the startng poston

2 of a possbe motf copy over sequences. In such way, we obtan a set of n substrngs X {x } n, n N (Lj K + ), that consttutes the set of observatons n our probem. 2. The basc approach In the basc approach, we ft a two-component mxture of mutnomas mode to the set of nput substrngs X. Here, we assume that each substrng x has been generated from ether a motf mode (y ) or a background mode (y ), as gven by the hdden (bnary) varabe y. The motf mode component has a pror probabty P (y ) π, whe the second component, the background, captures a the non-motf nformaton wth a pror probabty P (y ) π. The densty functon of the mxture mode for an observaton x s gven by f(x Θ) π(x θ) + ( π)p b (x b), () where Θ {π, θ, b} s the set of the (unknown) mode parameters: the pror probabty ( π ) and the two component densty parameters. A fexbe and most commony used way for descrbng a motf mode s through a poston weght matrx (PWM) θ [θ k ] of sze K M. Each eement θ k denotes the probabty of character c Σ to be found n the k-th poston of the motf, where for each poston k t hods θ k. On the other hand, the background dstrbuton s represented wth an M-vector of probabtes b [b,..., b M ] that s common for any substrng poston ( b ). Foowng the mutnoma dstrbuton and assumng ndependence among motf postons, the probabty densty functons of the motf (x θ) and the background mode p b (x b) are K (x θ) p(x y, θ) θ δ k k, (2) p b (x b) p(x y, b) k K δ k k b, (3) where δ k s the Kronecker deta,.e. f character c s found at the k-th poston of substrng x, otherwse. Based on the above formuaton, the motf dscovery probem can now be converted nto a maxmum kehood (ML) estmaton probem, where the og-kehood functon s gven by L(Θ) og p(x Θ) og f(x Θ). (4) The EM agorthm [6] consttutes an effcent framework for estmatng the mode parameters Θ { π, {θ k } and {b } }. It teratvey performs two man steps. At frst the E-step where the condtona expectaton z p(y x, Θ (t) ) of the hdden varabes y s computed p(y x, Θ (t) ) π (t) (x θ (t) ) π (t) (x θ (t) ) + ( π (t) )p b (x b (t) ), () and aso the expectaton of the compete data og-kehood s estabshed: Q(Θ, Θ (t) ) {og π + og (x θ)} + +( ){og( π) + og p b (x b)}.(6) The above Q-functon s maxmzed next durng the M-step over the mxture mode parameters. Ths gves the foowng update equatons: π (t+) θ (t+) k b (t+) n, (7) δ k, (8) M δ k ( ) K K k ( ) δ k. (9) The EM agorthm guarantees the convergence of the kehood functon to a oca maxmum where smutaneousy satsfes a the constrants of the parameters. At the end of the process, the occurrences of the estmated motf can be found by seectng the substrngs x whose posteror probabty vaue z s above a threshod (e.g..8). 2.2 The proposed approach Nevertheess, a sgnfcant drawback from the above basc approach appears due to the convenent..d. assumpton of the observatons x. Ths prevents nto takng nto account the vauabe spata nformaton of data. As a resut substrngs that are found n the neghborhood of a motf occurrence may be consdered as motf copes.e. to have hgh posteror probabty vaues z to be a motf. Ths phenomenon, whch s mosty common n recurrent type of motfs where a short part of the motf s repeated, may be substantay modfy the expressve motf mutnoma mode to a non-nformatve unform mode. Suppose, for exampe, that the motf we want to dscover s the T AT AT AT AT A, where the 2-gram T A s repeated tmes (K ). Suppose aso a copy of ths motf x T AT AT AT AT A. Then, as shown n Tabe, there are substrngs n ts neghborhood (whch overap wth the x ) that w contan a sgnfcant part of the motf. Ths w make ther posteror probabty vaues (Eq. ) to be sgnfcant arge, and thus ther contrbuton to the update rue of motf mode parameters θ k of Eq. 8 w be hgh. Ths w cause nto a step by step dramatc reducton of the hgher probabtes vaues n every motf poston (that correspond to the characters of the motf) durng the EM procedure. Under ths crcumstance, the EM agorthm w end up to a unform dstrbuton of M characters n K motf postons and thus the motf dscovery w be faed. In the terature there are varous methods that try to overcome ths occurrence by ndrecty or drecty adoptng spata nformaton to the mode. In [] for exampe, a normazaton of the posteror vaue z of the adjacent sequences s performed so that guarantees n any wndow of ength K the sum of z vaues remans ess than or equa to. Another approach was presented n [2] by ntroducng spata constrants to the mode through the use of a Markov Random Fed (MRF) pror over the motf abes. In ths study we work more systematcay over spataty by ncorporatng ths usefu knd 2

3 Tabe. Substrngs n the neghborhood of a motf occurrence x. ce they a have a sgnfcant overap wth the rea motf and thus hgh posteror vaues z, they w contrbute consderaby to the EM updated rue of the motf parameters. x 4 x 2 x x +2 x +4 ****TATATA **TATATATA TATATATATA TATATATA** TATATA**** of nformaton from observatons more naturay and drecty to the expressve motf mode The extended mode In the frst scheme, the orgna poston weght matrx θ s extended by K more nes to both sdes, so as to ncude the entre neghborhood of a motf. The motf mode s now descrbed wth a new matrx of dmenson (3K 2) M and the fttng procedure can now be seen as searchng over the new matrx mode θ to fnd the best suted bock matrx of K sze (number of nes). Therefore, ths matrx expanson suggests 2K (overappng) mutnoma dstrbutons equa to the number of a possbe startng postons wthn motf neghborhood (frst 2K nes of matrx θ), wth the foowng densty functon ( j,..., 2K ) φ j (x θ) p(x y j, θ) K θ δ k j+k,. () k The hdden varabe y now defnes the startng poston of the substrng x wthn the neghborhood of the motf. Note that when there s not any such poston (y 2K), the substrng s generated by the smpest background mode as prevous. The mxture mode densty functon now becomes f(x Θ) 2K 2K π jφ j(x θ) + ( π j)p b (x b). () In the exampe of Tabe the ntroducton of the new matrx θ aow the neghborng substrngs (x 4, x 2, x +2, x +4 ) that are overappng wth the rea motf copy (x ) to be ftted wth one of these mutnoma dstrbuton and not destroy the dstrbuton of the orgna motf. We w refer to ths scheme as the extended mode. Lkewse, the EM agorthm can now be apped for estmatng the mode parameters. It requres the cacuaton of the posteror probabtes of hdden varabes durng the E-step j P (y j x, Θ (t) ) π(t) j φ j(x θ (t) ), (2) f(x Θ (t) ) and the maxmzaton of the correspondng Q-functon durng the M-step Q(Θ, Θ (t) ) 2K +( { 2K j {og π j + og φ j (x θ)} + 2K j ){og( π j ) + og p b (x b)}}. (3) Ths gves the foowng update rues θ (t+) k k π j n z(t) j n j δ,k j+, k k jk K+ j jk K+ 2K j δ,k j+, k jk K+ 2K b (t+) jk K+ j j δ,k j+, j, (4), f k < K, f K k < 2K n ( 2K j ) K k δ k, f 2K k < 3K () K n ( 2K j ). (6) As t s cear n the update equaton for the motf extended matrx vaues θ k (Eq. ), every ne contrbutes more than one tme (except for the borders) to the computaton of the posteror probabtes z j), snce they beong to the 2K mutnomas severa tmes. Moreover, t must be noted that n practce t does not aways happen the rea motf to be found n the centra bock matrx (between the K and 2K nes) of the extended matrx θ. After convergence of the EM agorthm, the fna motf w be found by searchng for the K contnuous nes of the matrx that have the greater sum of ther maxmum probabty vaues θ k. Fnay, there are two other mportant advantages of the above scheme that must be dscussed. At frst t s ess senstvty to the ntazaton of the motf parameters. Even f we have not perfecty ntazed the motf and a part of t s mssng, the matrx extenson w fnay manage to fx t and ncorporate the mssng part to ts neghborng nes. Aso, the extended mode s not very senstve to the ength of motf K, snce the new expressve matrx has a suffcent space for motfs of overestmated or underestmated ength The margna mode In the second approach, the poston weght matrx θ remans the same as the basc approach (sze K M), but now t has a dfferent behavor. Here, we consder that n ths mode ony a part of t s used for fttng nput substrngs x, whe the remanng part s treated as background. Ths can be seen as appyng a shftng operator to the examned substrng x across the matrx θ so as to fnd the optma agnment between them n terms of fttng. There are two cases. Ether the frst j postons of x are modeed by the background and the rest K j postons are modeed from the frst j nes of the matrx θ, or the opposte,.e. the ast j nes of θ are responsbe for modeng the frst j motf postons whe the rest are consdered as background. In the negatve case, the substrng s consdered entrey as background nformaton (Eq. 3). Therefore, we have agan 2K, 3

4 mutnoma dstrbutons, equa to the number of a possbe combnatons of fttng the observaton wth the margna mode. Every densty functon s now gven by K j φ j (x θ) b 2K j k k δ k j k θ δ k j K+k, θ δ,k+k j, k b K k2k j+ δ k, j K,K + j 2K (7) In ths case the hdden varabe y defnes the part (number of ast or frst postons) of any observaton x that beongs to the motf mode. Ths scheme w be referred next as margna mode. The appcaton of the EM agorthm for estmatng the parameters of the margna mode eads to dfferent update equatons at the M- step, n comparson wth the extended mode. These are: θ (t+) k b (t+) K+k jk K { j δ,k j+k, K+k jk K j j j δ k + k 2K jk+ K 2K { (K j) j + j jk+ K k2k j+ (j K) j } (8) δ k } (9) At the end of the EM agorthm, the motf can be found by searchng for the frst (k ) and the ast (k 2 ) poston (ne of the matrx θ) among the totay K postons, where ther maxmum probabty vaue are above a threshod (e.g..8). Ths segment [k k 2] corresponds ony to a part of the true motf. Its rest part can be substantay obtaned by frst fndng the substrngs x wth hgh posteror probabty vaues (z k2 >.8) that capture the segment [k k 2 ], and then by the suffcent statstcs of ther neghborng substrngs n ether drecton. 2.3 Intazaton of motf modes The major drawback of the EM agorthm s ts great dependence on the nta vaues of the mode parameters that may cause nto gettng stuck n oca maxma of the kehood functon [2]. In our study, ths probem s many concentrated on the ntazaton of the motf mode parameters θ, snce the background densty b s aways ntazed by the reatve frequences of characters n sequences. Varous approaches have been proposed n the terature to overcome ths probem. In [] for exampe, a dynamc programmng approach s used whch estmates the goodness of many possbe startng ponts based on the kehood of the mode after one EM teraton. Another method presented n [3] appes a dvsve herarchca custerng approach that generates a motf modes parametrc search space. In ths study we have apped an ntazaton strategy that s based on a custerng scheme usng the cassca k-means agorthm. In the genera case the k-means agorthm ams at fndng a partton of M dsjont custers to a set of n observatons, so as to mnmze the overa sum of ther dstances wth the custer centers µ j. 4 Dependng on the type of observatons, one must determne an approprate dstance functon and aso a method for cacuatng custer centers. In our study, we ntay map every nput substrng x nto a poston weght matrx ϑ, where ts vaues were taken from the Kronecker deta vaues ϑ k δ k. In ths mode parametrc search space we perform then a custerng procedure by usng the Manhattan dstance functon between two subsequences x, x j: d(, j) K k M ϑ k µ jk ϑ µ j. Fnay, the custer centers µ j {µ jk } are estmated by the suffcent statstcs of the substrngs that currenty beong to jth custer (reatve frequences of characters c n every poston k). The number of custers for searchng was set to k n/n. When fnshng, we ntaze the motf mode wth the center µ j from the custer j that has the mnmum ntracuster dstance,.e. average dstance between a custer members and ts center. The expermenta study has shown that ths custerng scheme provdes satsfactory nta vaues of mode parameters θ very fast. 3 Expermenta resuts Severa experments were performed usng both artfca and rea datasets n an attempt to study the effectveness of the proposed approach. Durng a experments the custerng scheme of k-means was frst executed twenty (2) tmes and the optmum souton was kept to ntaze the motf mode. The pror probabtes π j were a ntay set to π /(2K). We have tested both versons of the proposed margna mxture mode, the extend () and the margna (). Comparatve resuts have been aso obtaned usng the basc ML approach (basc), as we as the Gbbs sampng () [9, ]. The method teratvey performs two steps unt kehood convergence: It randomy seects frst a sequence S and re-estmates the motf mode θ (t+) usng the current motf postons of a sequences but S. Then a new startng poston of the motf mode s seected n S (among the L K + possbe) by sampng from the posteror dstrbuton. Obvousy, ths verson of the assumes that each sequence has a unque occurrence of the pattern, whe our approach suggests an arbtrary number of motf copes n each sequence. Athough ths s not far, n our study we have seected datasets havng a snge motf copy n every sequence. Fnay, t must be mentoned that a methods were ntazed wth the same way, except from the method whch s better to randomy ntazed. The ast was executed 2 successve tmes wth dfferent seed vaue and the best souton found was kept. The generaton mechansm of the artfca sets used n our experments was the foowng: Usng a dscrete aphabet Σ wth M characters, we unformy produce a number N 2 sequences of varabe ength (wth a mean ength L ). Then, n each sequence we randomy seect a poston for pacng a nosy copy of a preseected seed motf of ength K, accordng to a mutaton probabty vaue (common to every motf poston). Eght dfferent vaues for the nose parameter were used, and for each vaue we generated dfferent sets of artfca sequences and kept statstcs (mean vaues and stds) of the performance of a the comparatve methods. We have used the foowng three performance crtera for evauatng each method: θ: Manhattan dstance between the estmated and the true motf

5 mode (dstance dstrbutons) θ θ θ true K M k θ k θ true k, 2 2 where true pattern densty θ true was estmated from the reatve frequences of characters of the nosy motf copes. : percentage of the predcted postons of a true motf copes n the set of nput sequences X. : percentage of the rea copes of a motf found havng detected at east 33% of ther postons. In must be noted that for the cacuaton of and quanttes, when a method found overappng substrngs n a motf copy neghborhood, ony ther mean vaue of the predcted number of postons woud have been kept for that copy. We have made experments n a varety of smuated sets of sequences (varous seed motfs, ength K, aphabet sze M and nose parameter ). Fgure ustrates the depcted resuts we obtaned wth four seed motfs and an aphabet of sze M 8. Each dagram represents the mean vaue and the standard devaton of the evauaton measurements ( θ,, ) as cacuated by each one of the four comparatve approaches usng dfferent nosy eves (). Both methods, extend and margna, showed a better performance n comparson wth the other two approaches. In amost a cases both schemes of our method had managed to estmate better the true mode accordng to the θ crteron. And ths s of great nterest. Aso, as t was expected, n the case of recurrent motfs (Fg. (c), (d)) the basc approach competey fas to dscover them. On the other hand, our approach yeded satsfactory resuts and comparabe to these of the method. The resuts were the same wth severa other experments wth smuated data created from aphabet of dfferent sze. We have aso evauated our method n a rea dataset generated by the motf nformaton of Eschercha co ReguonDB [7]. Ths database conssts of varous sets of DNA sequences 2 havng a snge motf copy per sequence wth symmetrc margns (background) on both ste of t. The study of these benchmarks s very nterestng and attractve, snce the ground truth (ocatons of motf occurrences) s known a pror and aso the degree of smarty of motf occurrences s very ow (extremey nosy motfs). More detas about these data can be found n [7]. In our experments we have used a part of eght (8) such data sets presented n Tabe 2, where we have consdered two margn szes: 2 and (a) seed motf:p AMEP ARAKAT W, wth no repettons, K 2, M (b) seed motf:p AMEP ARAKAT W P AMET W RA wth no repettons, K 2, M (c) seed motf:t AT AT AT AT AT A, wth repettons, K 2, M 8 Tabe 2. The descrpton of the rea data used n our expermenta study Fename Motf ength (K) Number of sequences (N) DnaA 9 7 FruR 4 Fur 9 2 LexA 2 CytR 4 PurR 76 SoxS 78 7 TyrR Fgure 2 summarzes the resuts obtaned by the four comparng methods. In partcuar, t ustrates the mean vaues of both measurements and, as cacuated after 2 dfferent runs n any of the 2 Data can be downoaded from (d) seed motf:t AT AT AT AT AT AT AT AT AT A, wth repettons, K 2, M 8 Fgure. Comparatve resuts wth smuated data. For each probem we present three dagrams wth the cacuated mean vaues and stds of the three performance measurements.

6 (a) 2 margn sze (b) margn sze Fgure 2. Comparatve resuts on the rea datasets of Tabe 2 n the case of 2 (a) and (b) margn sze. The vertca bars represent the mean vaues of the two cacuated measurements, after executons of each method. seected data sets. Athough these knd of motfs are not recurrent, on average, we obtaned better resuts wth our approach n comparson wth the basc ML method. Ths renforce the proposed margna mxture mode snce t manages to detect quatatvey better motfs n extremey nose parameter. Fnay, athough the comparson wth the verson of the method n unfar, our method yeded comparabe resuts especay n motfs of sma ength K. An expanaton of ths s that when searchng for motfs of arge ength, for exampe K 76, the seected margn sze of 2 or s not enough for capturng the neghborhood of motfs wthn the expressve mode and thus the proposed approaches cannot be normay apped. 83, (99). [2] K. Bekas, A mxture mode based Markov random feds for dscoverng probabstc patterns n sequences, n Panheenc Conference n Artfca Integence (SETN-26) (Lecture Notes n Artfca Integence), voume 39, pp. 2 34, (26). [3] K. Bekas, D.I. Fotads, and A. Lkas, Greedy mxture earnng for mutpe motf dscoverng n boogca sequences, Bonformatcs, 9(), 67 67, (23). [4] A. Brāzma, I. Jonasses, I. Edhammer, and D. Gbert, Approaches to the automatc dscovery of patterns n bosequences, Journa of Computatona Boogy, (2), , (998). [] B. Bréjova, C. DMarco, T. Vna r, S.R. Hdago, G. Hogun, and C. Patten. Fndng patterns n boogca sequences. Project Report for CS798g, Unversty of Wateroo, 2. [6] A.P. Dempster, N.M. Lard, and D.B. Rubn, Maxmum kehood from ncompete data va the EM agorthm, J. Roy. Statst. Soc. B, 39, 38, (977). [7] J. Hu, B. L, and D. Khara, Lmtatons and potentas of current motf dscovery agorthms, Nucec Acds Research, 33(), , (2). [8] R. Hughey and A. Krogh, Hdden Markov modes for sequence anayss: enson and anayss of the basc method, CABIOS, 2(2), 9 7, (996). [9] C.E. Lawrence, S.F. Atschu, M.S. Bogusk, J.S. Lu, A.F. Neuwand, and J.C. Wootton, Detectng subte sequence sgnas: a Gbbs sampng strategy for mutpe agnment, Scence, 226, 28 24, (993). [] J.S. Lu, A.F. Neuwad, and C.E. Lawrence, Bayesan modes for mutpe oca sequence agnment and Gbbs sampng strateges, J. Amer. Statstca Assoc, 9, 6 69, (99). [] X. Lu, D.L. Brutag, and J.S. Lu, BoProspector: dscoverng conserved DNA motfs n upstream reguatory regons of co-expressed genes, n Pac. Symp. Bocomput, pp , (2). [2] G.M. McLachan and D. Pee, Fnte Mxture Modes, New York: John Wey & Sons, Inc., 2. [3] E.P. Xng, W. Wu, M.I. Jordan, and R.M. Karp, LOGOS: A moduar Bayesan mode for de novo motf detecton, Journa of Bonformatcs and Computatona Boogy, 2(), 27 4, (24). 4 Concusons We have presented a margna mxture mode for dscoverng probabstc motfs n categorca sequences that ncorporates an advanced and more nformatve expressve motf mode. Two smar versons of that mode were proposed, one wth an expanson and another one wth a demtaton of bounds wthn poston weght matrx. Ths aows to ft the neghborhood of a motf that eads to an effcent and consstent nference of motf ocatons. Another sgnfcant advantage of our method s that the EM agorthm that s used to estmate the mode parameters, s ess depended on the mode parameters ntazaton than wth the cassca approach. Experments on a varety of artfca and rea data sets have shown mproved performance of the proposed scheme and ts abty to dentfy quatatvey better motfs, n comparson wth the basc ML approach and the method. We are pannng to further nvestgate the performance of the proposed method to other expermenta data sets and aso to desgn more compex expressve motf modes that can smutaneousy hande gaps among motf postons. REFERENCES [] T.L. Baey and C. Ekan, Unsupervsed earnng of mutpe motfs n Bopoymers usng Expectaton Maxmzaton, Machne Learnng, 2, 6

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not