Counting Patterns in Degenerated Sequences

Size: px

Start display at page:

Download "Counting Patterns in Degenerated Sequences"

Louise Summers
5 years ago
Views:

1 Countng Patterns n Degenerate Sequences Grégory Nuel MAP5, CNRS 845, Unversty Pars Descartes, 45 rue es Sant-Pères, F Pars, France gregory.nuel@parsescartes.fr Abstract. Bologcal sequences lke DNA or protens, are always obtane through a sequencng process whch mght prouce some uncertanty. As a result, such sequences are usually wrtten n a egenerate alphabet where some symbols may correspon to several possble letters ex: IUPAC DNA alphabet. When countng patterns n such egenerate sequences, the queston that naturally arses s: how to eal wth egenerate postons? Snce most usually 99% of the postons are not egenerate, t s consere harmless to scar the egenerate postons n orer to get an observaton, but the exact consequences of such a practce are unclear. In ths paper, we ntrouce a rgorous metho to take nto account the uncertanty of sequencng for bologcal sequences DNA, Protens. We frst ntrouce a Forwar-Backwar approach to compute the margnal strbuton of the constrane sequence an use t both to perform a Expectaton-Maxmzaton estmaton of parameters, as well as ervng a heterogeneous Markov strbuton for the constrane sequence. hs strbuton s hence use along wth known DFA-base pattern approaches to obtan the exact strbuton of the pattern count uner the constrants. As an llustraton, we conser a ES ataset from the EMBL atabase. Despte the fact that only % of the postons n ths ataset are egenerate, we show that not takng nto account these postons mght lea to erroneous observatons, further provng the nterest of our approach. Keywors: Forwar-Backwar algorthm, Expectaton-Maxmzaton algorthmn, Markov chan embeng, Determnstc Fnte state Automaton. Introucton Bologcal sequences lke DNA or protens, are always obtane through a sequencng process whch mght prouce some uncertanty. As a result, such sequences are usually wrtten n a egenerate alphabet where some symbols may correspon to several possble letters. For example, the IUPAC [] proten alphabet nclues the followng egenerate symbols: X for any amno-ac, Z for glutamc ac or glutamne, an B for Aspartc ac or Asparagne. For DNA sequences, there s even more of such egenerate symbols whch exhaustve lst an meanng are gven n able along wth observe frequences n several atasets from the EMBL [2] atabase. V. Karkamanathan et al. Es.: PRIB 2009, LNBI 5780, pp , c Sprnger-Verlag Berln Heelberg 2009

2 Countng Patterns n Degenerate Sequences 223 able. Meanng an frequency of the IUPAC [] DNA symbols n several fles of the release 97 of the EMBL nucleote sequence atabase [2]. Degenerate symbols lowest part of the table contrbute to 0.5% to % of the ata. symbol meanng est pro 0 htg pro 0 htc fun 0 st hum 2 A Aenne C Cytosne G Guanne hymne U Uracl R Purne A or G Y Pyrmne C,, oru M C or A K, U, org W, U, ora S C or G B not A D not C H not G V not G, notu N any base When countng patterns n such egenerate sequences, the queston that naturally arse s: how to eal wth egenerate postons? Snce most usually 99% of the postons are not egenerate, t s usually consere harmless to scar the egenerate postons n orer to get an observaton. Another opton mght be to preprocess the ataset by replacng each specal letter by the most lkely compatble symbol at the poston n reference wth some backgroun moel. Fnally, one mght come up wth some a hoc countng rule lke: whenever the pattern mght occurs I a one to the observe count. However practcal, all these solutons reman qute unsatsfactory from the statstcan pont of vew an ther possble consequences lke ang or mssng occurrences reman unclear. In ths paper, we want to eal rgorously wth the problem of egenerate symbols n sequences by ntroucng the strbuton of sequences uner the uncertanty of ther sequencng, an then by usng ths strbuton to stuy the observe number of occurrences of a pattern of nterest. o o so we place ourself n a Markovan framework by assumng that the sequence X l = X...X l s a orer 2 homogeneous Markov chan over the fnte alphabet A. Weenotebyν ts startng strbuton an by π ts transton matrx. For all a Aan for all b Awe then have: P X = a = νa = an PX + = b X + = a =πa,bwth l. One mght also thnk to a a fracton of one whch correspon to the probablty to see the corresponng letter at the egenerate poston. 2 For the sake of smplcty, the partcular egenerate case where = 0 s left to the reaer.

3 224 G. Nuel For all l we enote by X Athesubset of all possble values taken by X accorng to the ata. For example, f we conser a IUPAC DNA sequence ANWY... wehavex = {A}, X 2 = {A, C, G, }, X 3 = {}, X 4 = {A, }, X 5 = {C, },... In a frst part we establsh the strbuton of X l uner the constrant that X l X l usng an aaptaton of the Baum-Welch algorthm [3]. We then emonstrate that the constrane sequence s strbute accorng to a heterogeneous Markov moel whch startng strbuton an transton functon have explct expressons. hs result hence allows to obtan the exact constrane strbuton of a pattern by the applcaton of known Markov chan embeng technques. he nterest of the metho s fnally llustrate wth ES ata an scusse. 2 Constrane Dstrbuton In orer to compute the constrane probablty PX X X we follow the sketch of the Baum-Welch algorthm [3] by ntroucng the Forwar an Backwar quanttes. Proposton Forwar. For all x l A l an, l we efne the forwar quantty F x + ef = P X + = x +,X + X + whch s computable by recurrence through: F x + = x + 2 π x + 2,x + x X F for 2 l + an wth the ntalzaton F x = ν x IX x where I s the ncatrx functon 3. We then obtan that: PX l X l = F l x l l π x l l,x l. 2 x l l X l l Proof. We prove Equaton by smply rewrtng F x + as: = F x + = PX + x X = x + 2,X + 2 X + 2 }{{} PX + 2 x X F x + 2 = x +,X + X + PX + = x +,X + X + X + 2 = x + 2,X + 2 X + 2. }{{} πx + 2,x + I X+ x + he proof of Equaton 2 s establshe n a smlar manner. 3 For any set E, subset A E an element a E, I Aa =fa A an I Aa =0 otherwse.

4 Countng Patterns n Degenerate Sequences 225 Proposton 2 Backwar. For all x l Al an, l we efne the backwar quantty B x + ef = P X l Xl X+ = x + whch s computable by recurrence through: B x + = x +,x + B+ x x + X + π for 2 l an wth the ntalzaton B l x l l = x l X l π x l l,x l I X l l x l l. We then obtan that: PX l X l = x X ν x B x. 4 Proof. he proof s very smlar to the one of Proposton an s hence omtte. heorem Margnal strbutons. For all x l Al we have the followng results: a P X = x,x l X l = ν x B x ; b P X + = x +,X l Xl = F x + π x +,x + B+ x + + ; c P Xl l = xl l,xl Xl = Fl x l l π x l l,x l ; P X + = x +,X l Xl = F x + B x +. Proof. a, b, an c are prove usng the same contonng mechansms use n the proofs of propostons an 2. One coul note that a s a rect consequence of Equaton 4, whle c coul be erve from Equaton 2. hanks to Equaton 3, t s also clear that b whch acheves the proof. From now on we enote by P C A ef = P A X l Xl the probablty of an event A uner the constrant that X l Xl. heorem 2 Heterogeneous Markov chan. X l s a orer heterogeneous Markov chan uner P C whch startng strbuton ν C s gven by: ν C x ef = P C X = x ν x B x 5 an the transton matrx π+ C towar poston + s gven by: π C + x ef = P C X + = x + X + = x + π x +,x + B+ x Proof. Equaton 5 s a rect consequence of heorem a an Equaton 4. For Equaton 6 we start by enotng P C X + = x + X + = x + = PA B,C,D wth A = {X + = x + }, B = {X + = x + }, C = {X X}, and = {X+ l X }. hanks to Bayes formula we get that PA B,C,D PD A, B, C PA B,C. We fnally use the Markov property to get PD A, B, C = B + x + + an PA B,C = π x +,x + whch acheves the proof.

5 226 G. Nuel One shoul note the reverse sequence Xl = X l...x s also a heterogeneous orer Markov moel whch parameters can be expresse through the Forwar quanttes. 3 Estmatng the Backgroun Moel Let us enote by θ ef = ν, π the parameters of our orer Markov moel. We enote by P θ all probablty computaton performe usng the parameter ef = logp θ X l X l may be e- θ. Snce the log-lkelhoo L θ X l Xl rve ether from the Forwar or Backwar quanttes, t s possble maxmze numercally ths lkelhoo to get the Maxmum Lkelhoo Estmator MLE θ ef =argmax θ L θ X l Xl. We suggest here an alternatve approach foune on the classcal Expectaton- Maxmzaton algorthm for maxmum lkelhoo estmaton from ncomplete ata [4]. o o so, we smply conser that X l Xl s the observe ata, whle X l = x l s the unobserve ata. We then get the followng result: ef Proposton 3 EM algorthm. For any startng parameter θ 0 = ν 0,π 0, ef we conser the sequence θ j j 0 efne for all j 0 by θ j+ = ν j+,π j+ wth: I ν j+ a {a = X } ν j a θ B j a P θj X l X l 7 a πj a,b B θj a 2 b π j+ a,b = l = I {a b X + l = I {a } F θj + X } F θj a B θ j + a 8 where the F θj an B θj enote respectvely the Forwar an Backwar quanttes compute wth the current value θ j of the parameter, an wth the conventon that B θj l +. he sequence θ j j 0 converge towars a local maxmum of L θ X l Xl. Proof. hs comes from a specal applcaton of the EM algorthm [4] where the Expectaton step Step E conssts n computng Q θ θ j ef = x l X l P θ X l = x l X l X l log Pθ X l = x l whle the Maxmzaton step Step M conssts n computng θ + =argmax θ Q θ θ j. Equatons 7 an 8 then smply come from a natural aaptaton of the classcal MLE of a orer Markov chans usng the pseuo counts that come rectly from heorem.

6 4 Countng Patterns Countng Patterns n Degenerate Sequences 227 Let us conser herew a fnte set of wors over A. We wantto countthe number N of postons where W occurs n our egenerate sequence. Unfortunately, snce the sequence tself s not observe, we stuy nstea the number N of matchng postons n the ranom sequence X l uner P C. hanks to heorem 2 we hence nee to establsh the strbuton of N over a heterogeneous orer Markov chan. o o so, we perform an optmal Markov chan embeng of the problem through a Determnstc Fnte Automaton DFA as t s suggeste n [5; 6; 7; 8]. We use here the notatons of [8]. Let A, Q, s,f, δ beamnmal DFA recognzng the language 4 A W of all texts over A enng wth an occurrence of W see Fgure for an example of such a mnmal DFA. Q s a fnte state space, s Qs the startng state, F Qs the subset of fnal states, an δ : Q A Q s the transton functon. We recursvely exten the efnton of δ over Q A thanks to the relaton δp, aw ef = δδp, a,w for all p Q,a A,w A.We atonally suppose that ths automaton s non -ambguous 5 whch means that for all q Q, δ p ef = { a A, p Q,δ } p, a = q s ether a sngleton, or the empty set. A,C,G A,C,G A,C,G 0 A,C A,C,G 3 A,C,G A,C 2 G 6 4 G G 5 7 A,C Fg.. Mnmal DFA recognzng the language of all DNA sequences enng wth an occurrence of the IUPAC pattern NG. hs DFA have a total of 8 states, s =0 beng the startng state, an F = {7} beng the subset of fnal states. hs DFA s -ambguous snce one can reach states 0 or state 3 wth more than one letter. heorem 3 Markov chan embeng. We conser the ranom sequence over Q efne by X ef 0 = s an X ef = δ X,X, l. UnerP C, 4 A enotes the set of all possbly empty texts over A. 5 A DFA havng ths property s also calle a -th orer DFA n [7].

7 228 G. Nuel X s a heterogeneous orer Markov chan over Q ef = δs, A A such as, for all p, q Q an l the startng strbuton μ p ef = P C X = p an transton matrx + p, q ef = P C X+ = q X + = p are gven by: ν C μ p ={ a f a A,δ s, a = p ; 0 else { μ C + p, q = + δ p,b f b A,δp, b =q 0 else. Snce Q + contans all countng transtons, we keep track of the number of occurrences by assocatng a ummy varable y to these transtons. hen computng the margnal strbuton at the en of the sequence woul gve us access to the moment generatng functon mgf of the ranom number of occurrences see[5;6;7;8]formoreetals: Corollary Moment generatng functon. he moment generatng functon F y of the ranom number N uner P C s gven by: [ + l ] F y ef = P C N = k y k = μ P + + yq + 9 k=0 where s a column vector of ones an where, for all l, + = P + + Q + wth P + p, q ef = I q/ F + p, q an Q + p, q ef = I q F + p, q for all p, q Q. 5 Algorthm he practcal mplementaton of ths results requres two man steps: compute Forwar an Backwar quanttes; 2 compute the mgf usng Corollary. For the frst step, the resultng complexty s Ol both n space an tme. For the secon step the space complexty s OD Q whered s the fference between the maxmum an the mnmum egree of F y, an the tme complexty s Ol D Q A we take here takng avantage of the sparse structure of +. Usng ths approach on a large ataset ex: l =5 0 6 or l =3 0 9 may then result nto hgh memory requrements an/or long runnng tme. Fortunately, t s possble to reuce ramatcally these complextes when conserng egenerate sequence where most postons are etermnstc lke t s the case wth bologcal sequences. Let us enote by I ef = { l, X > } the set of egenerate postons n the sequence. It s clear then that the ranom state X j s completely etermnstc for all j J ef = { l, j > + I}. hepostonsj J thus contrbute n a etermnstc way to N wth a fxe number of occurrence n. It =

8 Countng Patterns n Degenerate Sequences 229 hence remans only to take nto account the varable part N n = N N k where the N are nepenent contrbutons of each of the k segments of J the complementary of J n {,...,l}. If we enote by F y themgfofn,wegetthat F y =y n k F y whch ramatcally reuces the complexty of the problem. Snce each F y may be obtane by a smple applcaton of Corollary on the partcular short segment of nterest, an one only nee to compute the Forwar-Backwar quanttes for ths partcular segment. For example, let us conser that the observe IUPAC sequence s x l = AAYGCANGBAGGCACWAGR an that = 2. We have I = {3, 7, 9, 20, 24} an J =[3, 5] [7, ] [20, 22] [24, 25]. In orer to compute F y, F 2 y, F 3 y anf 4 y, one just nee to known the orer =2pastbeforeeachofthe corresponng segment: AA for the frst, CA for the secon, C for the thr, an G for the last one. 6 Dscusson Let us conser the ataset est pro 0 whch s escrbe n able. Here s the transton matrx over of a orer = homogeneous Markov moel over A = {A, C, G, } estmate on ths ataset usng MLE though the EM algorthm: π = Snce only % of the ataset s egenerate, we observe lttle fference between ths rgorous estmate an one obtane through a rough heurstc lke scarng all egenerate postons n the ata. However, ths result shoul not be taken as a rule, especally when conserng more egenerate sequences e. g. wth 0% egenerate postons an/or hgher orer Markov moels e. g. =4. Usng ths moel, t s possble to stuy the observe strbuton of a pattern n the ataset by computng though Corollary the strbuton of ts ranom number of occurrence N uner the constrane probablty P C. able 2 compares the number of occurrences obtane by scarng all egenerate postons n the ata Count to the observe strbuton. Despte the fact that only % of the ata are egenerate, we can see that there s great fferences between our nave approach an the real observe strbuton. For example, f we conser the smple pattern GCA we can see that the nave count of 75 occurrences les well outse the 90% creblty nterval [727, 740] an we have smlar results for the other consere patterns. =

9 230 G. Nuel able 2. Dstrbuton of patterns n the egenerate IUPAC sequences from est pro 0. Count s obtane by scarng all egenerate postons n the ataset, an Count2 by replacng each specal letter by the most lkely compatble symbol. Snce the observe strbuton s screte, percentles an mean are roune to the closest value. pattern Count Count2 mn 5%-tle mean 95%-tle max GCA AG NG RNANNNSM For more complex patterns lke NG the fference between the nave count an the observe strbuton s even more ramatc snce 839 oes not even belong to the support [853, 005] of the observe strbuton. hs s ue to the fact that the strng NG actually occurs = 4 tmes n the ataset. Snce our nave approach scars all postons n the ata where a symbol other than A, C, G or appears, these 4 occurrences are hence omtte. If we now preprocess the ataset by replacng all egenerate symbols by the most frequent letter n the corresponng subset we get the number of occurrences enote Count2. If ths heurstc seems to gve an nterestng result for Pattern GCA countng close to the mean, t s unfortunately not the case for the other ones for whch the metho results ether n uner-countng Pattern NG or over-countng patterns AG an RNANNNSM. As a general rule, t s usually ffcult to prect the bas ntrouce by a partcular heurstc snce t can ether lea to uner- or over-coutngs for example Count always result n uner-countngs an that ths may even epen on the pattern of nterest lke wth Count2. he rgorous metho we have here evelope may hence also prove a way to test the statstcal propertes of a partcular heurstc. Fnally, let us pont out that thanks to the optmal Markov chan embeng prove by the DFA-base approach presente above, we are here able to eal wth relatvely complex patterns lke RNANNNSM. 7 Concluson In ths paper, we prove a rgorous way to eal wth the strbuton of Markov chans over a fnte alphabet A uner the constrant that each poston X of the sequence belongs to restrcte subset X A. We prove a Forwar-Backwar framework to compute margnal strbutons an erve from t a EM estmaton proceure. We also prove that the resultng constrane strbuton s a heterogeneous Markov chans an prove explct formulas to recursvely compute ts transton matrx. hanks to ths result, t s possble to apply known DFA-base methos from pattern theory to stuy the strbuton of a pattern of nterest n ths constrane sequence, hence provng a trustful observe strbuton for

10 Countng Patterns n Degenerate Sequences 23 the pattern number of occurrences. hs nformaton may then be use to erve ap-valuep for a pattern by combnng p n the p-value of the observaton of n occurrences n a unconstrane ataset wth the observe strbuton through formulas lke p = n p np C N = n. One shoul note that the approach we ntrouce here may have more applcatons than just countng patterns n IUPAC sequences. For example, one mght use a smlar approach to take nto account the occurrences postons of known patterns of nterest thus allowng to erve strbuton of patterns contonally to a possbly complex set of other patterns. One shoul also pont out that the constrant X X shoul easly be complexfcate, for example by conserng a specfc strbuton over X. For nstance, such a strbuton may come from the posteror econg probabltes of a sequencng machne. From the computatonal pont of vew, t s essental to unerstan that the heterogeneous nature of the Markov chan we conser forb to use classcal computatonal trcks lke power computatons. he resultng complexty s hence lnear wth the sequence length l rather that logarthmc. However, one shoul expect a ramatc mprovement of the metho by restrctng the use of heterogeneous Markov moels only n the vcnty of egenerate postons lke t s suggeste n Secton 5. Wth such an approach, one mght rely on classcal pattern matchng for 99% of the ata, an the metho presente above woul be restrcte to the stuy of the % remanng ata. Usng ths computatonal trck, t hence seems possble to rely on the rgorous exact computaton ntrouce here rather than on a bas heurstc. Fnally, we have emonstrate wth our example that even a small amount of egenerate ata may have huge consequences n terms of pattern frequences, an thus possbly affect every subsequent analyss metho nvolvng these frequences lke Markov an hen Markov moel estmatons an pattern stues. Conserng the possble bas cause by egenerate letters n bologcal ata, an the reasonable complexty of the exact soluton we ntrouce n ths paper, our stuy suggests that the problem of egenerate ata n pattern relate analyss shoul no longer be gnore. References [] IUPAC: Internatonal Unon of Pure an Apple Chemstry 2009, [2] EMBL: European Molecular Bology Laboratory Nucleote Sequence Database 2009, [3] Baum, L.E., Petre,., Soules, G., Wess, N.: A maxmzaton technque occurrng n the statstcal analyss of probablstc functons of markov chans. Ann. Math. Statst. 4, [4] Dempster, A.P., Lar, N.M., Rubn, D.B.: Maxmum Lkelhoo from Incomplete Data va the EM Algorthm. Journal of the Royal Stat. Socety. Seres B 39, [5] Ncoème, P., Salvy, B., Flajolet, P.: Motf statstcs. heoretcal Com. Sc. 2872,

11 232 G. Nuel [6] Crochemore, M., Stefanov, V.: Watng tme an complexty for matchng patterns wth automata. Info. Proc. Letters 873, [7] Llaser, M.E.: Mnnal markov chan embengs of pattern problems. In: Informaton heory an Applcatons Workshop, pp [8] Nuel, G.: Pattern markov chans: optmal markov chan embeng through etermnstc fnte automata. J. of Apple Prob. 45,

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise Dustn Lennon Math 582 Convex Optmzaton Problems from Boy, Chapter 7 Problem 7.1 Solve the MLE problem when the nose s exponentally strbute wth ensty p(z = 1 a e z/a 1(z 0 The MLE s gven by the followng: