IMA Preprint Series # 2103

Size: px

Start display at page:

Download "IMA Preprint Series # 2103"

Emil Underwood
6 years ago
Views:

STATISTICAL CHARACTERIZATIO OF PROTEI ESEMBLES By Dego Roter Gullermo Sapro an Vjay Pane IMA Preprnt Seres # 03 ( Marc 006 ) ISTITUTE FOR MATHEMATICS AD

1 STATISTICAL CHARACTERIZATIO OF PROTEI ESEMBLES By Dego Roter Gullermo Sapro an Vjay Pane IMA Preprnt Seres # 03 ( Marc 006 ) ISTITUTE FOR MATHEMATICS AD ITS APPLICATIOS UIVERSITY OF MIESOTA 400 Ln Hall 07 Curc Street S.E. Mnneapols, Mnnesota Pone: 6/ Fax: 6/ URL: ttp://

2 Statstcal Caracterzaton of Proten Ensembles Dego Roter,, Gullermo Sapro, an Vjay Pane 3 Abstract Wen accountng for structural fluctuatons or measurement errors, a sngle rg structure may not be suffcent to represent a proten. One approac to solve ts problem s to represent te possble conformatons as a screte set of observe conformatons, an ensemble. In ts work, we follow a fferent rcer approac, an ntrouce a framework for estmatng probablty ensty functons n very g mensons, an ten apply t to represent ensembles of fole protens. Ts propose approac combnes tecnques suc as kernel ensty estmaton, maxmum lkeloo, cross-valaton, an bootstrappng. We present te unerlyng teoretcal an computatonal framework an apply t to artfcal ata an proten ensembles obtane from molecular ynamcs smulatons, an compare te results wt tose obtane expermentally, llustratng te potental an avantages of ts representaton. Introucton an motvaton A sngle structure s often use to represent a proten. Ts can be consere natural wen te structure s assume to be rg, but proten structure s known to fluctuate uner pysologcal contons. Ts fact, overlooke n many stuatons n favor of te smplcty of a sngle structure, may be avantageously accounte for at tmes wen g resoluton tecnques for structure etermnaton are avalable. Even f te true structure were fxe an unque, te uncertanty n ts etermnaton by te (mperfect) measurement of some property (.e., ffracton, magnetc resonance, etc.), also prouces varablty, snce tose metos generally optmze a moel to ft te observatons, a process prone to fn multple local mnma. A smlar stuaton arses wen smulatons are use for structure etermnaton, te moele energy lanscape s populate wt multple local mnma. As a result of ts an oter ntrnsc smulaton caracterstcs (e.g., ranomness), multple structure representatves for te same proten are possble. In applcatons were te fluctuatons can not be gnore, an moreover, are to be favorably explote, ow soul tey be represente an ncorporate nto te calculatons? One way s to represent te proten structure not as a sngle conformaton but as a fnte set of conformatons, corresponng to fferent observatons of ts state. In ts work we propose a fferent rcer approac, consstng of estmatng a probablty ensty functon (pf) from te avalable observatons of te state, an usng ts pf to represent te ensemble (a fnte set of conformatons s just a partcular case of ts, wt te pf beng elta functons place n te observaton ponts). Ts rc representaton s startng to gan nterest n te proten researc communty, e.g., [, ] an Lnorff-Larsen (personal communcaton). For example, ts type of representaton as been recently pursue to rank te space of conformatons n agreement wt MR observatons []. Department of Electrcal an Computer Engneerng, Unversty of Mnnesota, 00 Unon St. SE, Mnneapols, M 55455, USA. rot,gulle@ece.umn.eu To wom corresponence soul be aresse. 3 Cemstry Department, Stanfor Unversty, Stanfor, CA , USA. pane@stanfor.eu

3 Ts representaton may allow one to see aspects prevously en wen te ensemble was regare as a set of screte conformatons (e.g., te moes), an also proves a natural framework to perform certan operatons, e.g., compare ensembles of te same proten tat were obtane by fferent metos, or etermne te probablty tat a partcular conformaton belongs to te ensemble [3]. ew conformatons tat combne propertes of te ensemble can be obtane as well from te pf. A possble furter applcaton of te framework comes from te observaton tat multple local mnma le close to te global mnmum, as postulate to explan te robustness of te natve state, an suggeste as a way to prect t for certan proten classes [4]. Ts observaton, f general, can be translate nto te requrement tat te natve structure (or ensemble) must rese were te ensty of local mnma s g. Furter motvaton for te necessty to unerstan te conformatonal space an ts probablty strbuton s suggeste by recent work towar g-resoluton e novo structure precton [5], see n partcular te autors remark tat conformatonal samplng remans te prmary stumblng block towar ts callengng goal. It s mperatve ten to ave a goo escrpton of te ensemble, an eally, suc escrpton s gven by a probablty ensty functon. Important by-proucts of te approac ere ntrouce nclue an ea of te completeness of te sample to represent te space of conformatons tat te proten aopts, an estmate of te conformatonal entropy (an ts error) wc may ave mportant termoynamc consequences (Lnorff-Larsen, personal communcaton); an a measurement of te epenency between varables. It s tereby clearly supporte by te current efforts n proten researc te nee for a goo unerstanng of te proten conformatonal space, an n partcular, of ts probablty ensty functon (pf). It s te prmary goal of ts paper to present a teoretcal an computatonal framework to compute suc a probablty ensty functon. Proten ensembles consst of conformatons usually avng unres or even tousans egrees of freeom. How can nferences be mae from samples szes tat are rougly of te same orer? Ts callengng queston s aresse n ts artcle. We erve, uner clear optmalty crtera from nformaton teory, te best possble pf from te avalable ata. To aceve ts, we ecompose te global ensty as a prouct of lower mensonal factors, contonal probabltes temselves, cosen by a genetc algortm to maxmze te global lkeloo. Te approac explots te fact tat eac egree of freeom (or coornate) oes not strongly epen on every oter coornate, but only on a few, wc are automatcally foun by our approac. Ten, we procee to estmate eac factor usng classcal ensty estmaton tecnques, tat s, Kernel Densty Estmaton an Maxmum Lkeloo. In aton to computng te probablty strbuton of te ensemble, we explctly an automatcally obtan te crtcal epenences between te varables, suc as torson angles. Te man ata use n ts work comes from smulatons of proten ensembles obtane by means of molecular ynamcs [6]. Te framework ere escrbe can be use to caracterze oter proten ensembles, compute eter va molecular ynamcs or usng oter structural etermnaton meto (e.g., rotamerc lbrares wt applcatons n g-resoluton proten folng [7], or even rect multple pyscal measurements). Te framework can also be use to nclue proten flexblty n proten ockng [8]. More on ts wll be presente n te scusson secton. Te remaner of ts paper s organze as follows. In Secton we gve a escrpton of

4 te matematcal an computatonal meto propose to compute te esre probablty ensty functon, gven a fnte set of conformatons. As a proof of concept an for peagogc reasons, we frst use te evelope framework on an artfcal ataset. Ts s presente n Secton 3.. In Secton 3. we use te framework n real ata. In aton to computng te pf, we explctly erve te crtcal nner epenences of te torson angles, an prouce novel conformatons sample from te compute pf. Ter relatonsp wt expermental ata s stue as well. Conclung remarks an scussons are prove n Secton 4. Metos An ensemble 4 s a set of conformatons of te same proten. Eac conformaton correspons to a partcular arrangement of te proten s consttutve atoms n tree mensonal (3D) space. Ts arrangement can be escrbe (or partally escrbe) by fferent sets of features epenng on te applcaton at an. In ts work, we conser te backbone of te proten, wc can be completely escrbe by te usual (M-) torson angles ((M-) φ s an (M-) ψ s) were M s te number of resues n te proten [9]. Our goal s to evelop a tecnque to estmate te ensty of te unknown process tat generates te set of conformatons, te ensemble. Ts ensty s to be estmate from ts avalable samples (a fnte set of conformatons represente by vectors of lengt (M-)). To aress ts we conser tat a coornate of te sample conformaton s relate to just a few oter coornates, wtout knowng n avance to wc ones. In oter wors, we set out to nfer te relatonsps between te coornates (torson angles n our example), an use ts nformaton to estmate te ensty of te process more effcently. Wle ts s a natural assumpton base on te cemcal nature of protens, t s also funamental to reuce te mensonalty of te problem, wc s neee ue to te exstence of only fnte an relatvely few observatons. Te propose computatonal framework nvolves a number of components n te fels of statstcs, nformaton teory, artfcal neural networks, an computer scence. It s plosopcally relate to Hnton s Proucts of Experts (PoE) [0], n te sense tat several low mensonal probablty enstes ( experts n Hnton s termnology), eac one able only to explan local features of te ataset, are compose (multple) to explan global features. Tese approaces are best at explanng global features from local etals, but can not n general anle te effects of global features on local etals. Our approac ffers from Hnton s n tat te experts are nepenent by (automatc) constructon, tus avong te nee to renormalze te prouct. An atonal fference between te approaces s tat Hnton uses parametrc moels for te experts, wle we estmate ts sape rectly from te ata. Furter restrctons on te experts n our approac also guarantee te expert s eterogenety, an assure tat all te local features are consere n te constructon of te global ensty. Our approac s also relate to Akake s Informaton Crteron (AIC) [, ], n tat te coce of te orer of te moel selecte s base n te Kullback-Lebler stance to te true unknown probablty ensty. Contrary to AIC, n our approac te number of parameters oes not explctly appears n te crteron, but only troug te egree of smootng apple. Ts as 4 To avo te confuson erve from usng te wor sample to enote a set of conformatons an also a sngle conformaton from tat set we reserve ts wor for te frst meanng ( a set of conformatons ). We also use ensemble for te same concept. We use te wors observaton an pont for te secon meanng ( a sngle conformaton ). 3

5 te avantage tat eac parameter s not equally wegte, but t s wegte accorng to te partcular role t plays n te moel. In ts secton we present te propose ensty estmaton framework, an te ratonale ben te selecton of te partcular metos to fulfll eac task. For tat purpose, an for easy reference an completeness, a bref revew of eac relevant meto s nclue. In Secton., a proceure for estmatng a ensty s ntrouce. In Secton., we analyze te errors an lmts of ts estmaton proceure. Fnally, n Secton.3, we exten te tecnque for te kn of ataset of nterest (conformaton ensembles). To avo obscurng te man concepts, te non crtcal mplementaton etals are omtte from ts artcle. Tey can be obtane, togeter wt te coe, from te autors by request.. Densty estmaton.. Maxmum lkeloo prncple In te searc for te best ensty estmate, te best way to start s to efne wat s best, or at least, wat s better. Te fel of statstcs proves a tool for tat purpose: te maxmum lkeloo prncple [3]. Te ratonale ben te maxmum lkeloo prncple can be state qute smple: from all te possble moels (or enstes) tat coul ave generate te ensemble, select te most probable one. More formally, let S = { x, x,..., x } be an ensemble of nepenent an ( ) entcally strbute observatons n R M, generate by one of te moels n, te set of all possble moels (formally efne n..). Wtout gong nto furter etals at ts pont, let us menton tat bot te ata ponts x an te moels belong to a contnuous space. Ts s te reason to use enstes nstea of probabltes trougout tese explanatons. Let M be a moel n te famly parameterze by, a parameter to be etale n Secton... Ten, usng Bayes law, te probablty ensty tat te moel generate te ensemble can be expresse as: p ( ) ( ) ( M ) p M S = p S M. p( S) were: p ( S M ) s te probablty ensty tat te moel M generates te ensemble S. We wll refer to t as te ensemble (or sample) lkeloo an we wll enote t by L ( S). Snce te observatons n S are assume to be nepenent, ts term can be easly compute (now te moel s assume L S = P S M = P x M = f x, were f s te ensty to be known), ( ) ( ) ( ) ( ) = corresponng to te moel M ; p ( S) s te uncontone probablty ensty of te ensemble. It s of lttle nterest ere, snce t s equal for all te moels an terefore as s commonly one, we wll gnore t; an p s te a pror probablty ensty of te moel, knowlege tat soul bas te coce of ( ) M one moel over anoter. Wen suc nformaton s not avalable, or f t s not relable, te common practce s to assume tat all moels are equally probable. For te atasets ealt wt n ts artcle, a pror nformaton ctate by te pyscs of te process s avalable (see [4] ). everteless, we coose for smplcty not to nclue ts nformaton n te moel at ts stage, = 4

6 keepng n mn tat te results can be mprove by ong oterwse. ote tat ts a pror probablty s efne over te space of moels. p( M ) Consequently, te factor s te same for all moels (wen all moels are equally p( S) possble), an te most probable moel s smply te one tat makes te ata most probable, maxmzes te sample lkeloo. Smplfcaton can be obtane by maxmzng te logartm of ts quantty nstea, log L S = log P S M = log f x ( ( )) ( ) ( ) Snce te logartm functon s monotoncally ncreasng, t attans te maxmum at te same moel but wll ave a muc smpler ervatve. It wll be useful for later evelopments to note te close relatonsp between te loglkeloo an te emprcal entropy or fnte sample average of entropy, 5 efne as [5] : H = = S x = ( f ) log( f ( )) Its relaton to te log-lkeloo s realy apparent: H S ( f ) = log( L ( S) ) Ten, maxmzng log-lkeloo s equvalent to mnmzng te emprcal entropy, an snce we fn te notaton smpler an te concepts rcer, we coose to work wt te latter. An entcal result can be obtane troug a completely fferent (at frst sgt) approac usng te relatve entropy, also known n te lterature as te Kullback-Lebler vergence, cross entropy, or asymmetrc vergence [5]. It s efne, for two enstes f(x) an g(x), as f ( x ( ) = ( ) ) D f g f x. log x g( x) Te relatve entropy s a measure of te stance between two strbutons. 6 Consequently t seems natural to efne te best ensty estmate ( ˆ f ( x ) ) as te one tat mnmzes te stance to te true unknown ensty ( f (x) ). Te new score to mnmze s [3] : D = fˆ ( x) ( ˆ f ( x) f f ) f ( x) log x = f ( x) log( f ( x) ) x f ( x) log( fˆ( x) ) = H ( f ) E ( ( ˆ ) ( ( ˆ log f ( x) H ( f ) E log f ( x) ) = H ( f ) H ( fˆ ) f S + x = Snce te frst term n te last expresson s constant, te expresson to mnmze s entcal to te one tat alreay was foun n Equaton (). Te maxmum lkeloo moel s te one tat s closer to te true ensty, as measure by te relatve entropy stance. Te approxmately S () = f x) log f ( x) 5 Snce te emprcal entropy converges to te entropy ( ) x ( as te sample sze grows [34], we may abuse language an use te terms as synonyms. 6 Strctly speakng t s not a stance snce t s not symmetrc an oes not satsfy te trangle nequalty. oneteless, t s always non-negatve an zero f an only f f = g, an for ts reason t s often useful to tnk of t as a stance between strbutons [5]. 5

7 equal symbol n te last ervaton entals tat, wen a fnte sample s use, te score obtane s only an approxmaton to te true score for te moel. Te mplcatons of ts fact are scusse n Secton.. Dfferent metrcs coul ave been cosen to measure te screpancy between te estmate an true enstes, an eac coce woul ave resulte n a fferent (optmal) estmate. Our coce, base on classcal nformaton teory an statstcs, tres to capture te orer of te true ensty (reflecte by a functon of te quotent of te two enstes beng ntegrate) rater tan ts absolute value (as woul be te case for te L p norm, were a functon of te fference between te enstes s ntegrate). An atonal avantage of ts coce s tat t leas to more tractable calculatons. As Vola ponts out [3], tere are tree man reasons wy maxmum lkeloo may fal to fn an accurate moel: a) Tere are no suffcently accurate moels n te consere set of possble moels. For ts reason t s mportant to mpose only te weakest assumptons on te ensty, n our case smootness, Secton..; b) Te searc for te best moel may fal to scover te moel tat globally mnmzes te entropy (because t gets stuck n a local mnma), even toug t belongs to te consere set of possble moels, ence te mportance of a goo optmzaton algortm; an c) Unlkely observatons rawn from a moel are only mprobable, not mpossble. If an unlkely sample s rawn from a moel, t coul well be assgne to anoter moel tat makes t more lkely. Ts rsk becomes smaller as te sample sze grows... Te ypotess space Havng prove a way to compare moels, an from ts to select te best one, te next step s to efne te set of possble moels, or ypotess space, from were te best moel soul be cosen. Snce we o not want to loose generalty at ts pont by restrctng our attenton to a partcular kn of enstes, we aopt te sample-base approac, were te sample tself efnes te moel. In partcular, we are ntereste n te meto of Kernel Densty Estmaton, also known as Parzen Wnow Densty Estmaton [6]. Accorng to ts meto, n one menson te ensty estmates are gven by x x fˆ ( x, S) =. K () = were K(x) 7 s a probablty ensty functon known as te kernel functon an, apart from nexng te space, s te wnow wt, also known as te banwt or te smootng parameter. An alternatve but oterwse equvalent expresson can be gven n terms of te x convoluton of te sample wt te kernel, fˆ ( x, S) = K δ ( x ) x. = Te role of te kernel s to sprea te mass of te observatons aroun ts orgnal poston. Usually, K s a unmoal even functon, fallng off quckly to zero. Bell sape functons n general, an Gaussans n partcular, are frequently use kernels. In ts work we use te Von Mses kernel, wc plays te role of te Gaussan ensty for angular ata [7]. 7 To smplfy te exposton we assume tat te kernel s stretce by te banwt. For peroc kernels (n [ 0,π ) n our case) clearly ts approac oes not work, an te sape of te kernel must cange as well wen te banwt canges. See [7] for a etale scusson. 6

8 Wat was just presente n one menson s easly generalze to ger mensons. In - mensons, te equaton for te kernel approxmaton (analogous to Equaton ()) s x x fˆ ( x, S) =. K (3) =..3 Crossvalaton It was state n Secton.. tat one of te reasons tat mgt lea to te maxmum lkeloo crteron to perform poorly s te absence of aequate moels n te ypotess space. Ts coul tempt te naïve user to nclue as many moels as possble n tat set. In partcular, for te efnton of te set of functons tat we gave n te prevous secton, ts means tat no constrans are mpose on te banwt. Ts s not recommenable, let s see wy. Wen te same sample s use bot to approxmate te ensty functon an to estmate te entropy (lkeloo), te expresson for te entropy (from equatons () an (3)) becomes: x x j H ( ) ( ( )) S fˆ = fˆ log x, S = log. K (4) = = j= Remember tat Equaton (4) assgns a score to eac moel/ensty fˆ. Te lower te score, te better te moel. Te problem of ensty estmaton can ten be state as fnng te moel, or n oter wors fnng te banwt, tat mnmzes Equaton (4). Unfortunately, ts expresson as no mnmum for ( 0, ), an as t tens to mnus nfnty, te corresponng selecte moel gets furter an furter away from te esre moel. Snce te mass of te kernel sts aroun te orgn an falls off quckly as we move away from t, x x j x x K K = K ( 0) 0, an ( ˆ ) ( ) j= K 0 H S f log. As a 0 0 result, te ensty estmate tens to te functon tat as one elta n eac one of te sample ponts. Te soluton was over-trane to ft te ata, sregarng any prevous knowlege about te true ensty we may ave a. Ts stresses te mportance of carefully coosng te ypotess space. Wat s known about te functons soul be nclue n te efnton of te ypotess space to avo overfttng, but not overong t, rskng to exclue te correct ensty from te ypotess space. A possble escape from ts stuaton s to set a lower lmt for te banwt. But, ow to select ts lmt n a sensble way s far from trval. Furtermore, ts lmt soul epen on te sample sze an te unknown ensty tself. We rater use anoter approac nstea, known as cross-valaton or leave-one-out [3]. Alternatve approaces for estmatng te banwt can be foun n [6, 8]. Snce te problem orgnate for usng te same sample twce (bot to construct te ensty functon an to estmate te entropy by evaluatng te ensty n te sample ponts), te crossvalaton tecnque splts te sample n two (or uses two fferent samples f possble), an uses one part to construct te ensty an te oter to estmate te entropy. It may seem at frst as a waste of ata to use some sample ponts to compute te entropy, nstea of usng tem to estmate te ensty, wc s ultmately wat we want to o. Crossvalaton cleverly solves ts problem spenng only one of te ponts of te sample: One part of te set contans all te ponts but one, an s use to construct te ensty, wle te oter part as a sngle pont tat s use to evaluate te ensty (.e. te ensty s compute at ts pont). 7

9 Ts process s repeate tmes, leavng out eac pont once, an obtanng a corresponng ensty estmate eac tme. Te contrbuton of every pont to te entropy s ten ae togeter to get a fnal estmate of te entropy. Te new expresson for te entropy tat as to be mnmze s: ( ) x x j H S fˆ = log. K = = (5) ( ) j j Altoug te prase repeate tmes may suggest te opposte, te comparson of equatons (4) an (5) sows tat te meto wt cross-valaton takes almost te same amount of tme...4 Puttng tngs togeter: A smple example Havng escrbe a crteron to compare moels (cross-valaton maxmum lkeloo, sectons.. an..3), an a set of moels from wc to cose (a famly of kernel approxmate functons, Secton..), we reac a pont were t woul be elpful to present an example sowng ow ensty estmaton actually works. For te sake of te example, suppose tat a ensty f(x) as sown wt a black lne n Fgure a as to be estmate from a sample consstng of 00 ponts, sown n te same fgure as vertcal re lnes. ote tat f(x) s scontnuous an tus oes not belong to te famly of functons tat can be constructe by placng smoot kernels on fnte sample ponts (s not ten n our ypotess space), but t can be approxmate. Te sample s use to efne a famly of ensty functons (va te kernel meto explane n Secton..), were eac member s caracterze by a banwt value. Usng Equaton (5), a score (or entropy value H) s compute for eac member of te famly. Fgure b sows te corresponence between tose values (, te banwt, an H, te entropy). As explane n Secton.., te functon avng te lowest entropy (marke wt a re crcle n Fgure b) s foun (usng classc graent escent optmzaton [9] ), an selecte as te one tat best estmates te true ensty. Ts functon s epcte n re n Fgure c, along wt two oter members of te same famly of functons: one (n blue) avng a smaller banwt (blue crcle n Fgure b), wc was over-trane/over-ftte to te ata, sowng excessve oscllatons, an te oter (n green) avng a larger banwt (green crcle n Fgure b), was over-smoote, loosng part of te structure of te true ensty. Te corresponng kernels use for te approxmaton are sown n Fgure.. Qualty of te estmates In te prevous secton, two qualtatvely fferent but nterrelate estmates were obtane: ensty estmates an entropy estmates. It s our nterest n ts secton to stuy te qualty of bot of tem. In Secton.., te screpancy between te ensty estmate an te true ensty s analyze. Snce our goal s to estmate a ensty, we fn partcularly mportant to get a feelng for te kn of errors we soul expect. In Secton.. te error n te score (entropy) s stue. Snce ts score s use to compare an coose moels, te error n te score wll clearly affect te cances of selectng a goo moel. 8

10 Fgure : Te ensty estmaton process. a) Te true ensty an te sample. b) Te entropy lanscape. c) Tree estmates corresponng to tree fferent banwts: te re, green an blue are te best, over-smoote an unersmoote estmators respectvely. ) Te kernels use for eac of te estmates n c)... Te ensty estmate An estmate for te ensty as been obtane n te prevous secton. But ow s t relate to te true ensty? Informally, we coul say tat te estmate wll be a smoote verson of te true ensty, plus ranom nose [6]. Ts can be seen by conserng te expectaton of te estmate at a sngle pont n te scalar case. By efnton, ts s not an unbase estmator of te ensty. 8 Te bas an varance n te ensty estmate at a pont (tese are bas s a pontwse measure), coul be easly compute [6] : σ K bas( x) = f ( x) + ger - orer terms n σ K t K( t ) t f ( x) var ( x) = var( fˆ ( x, S) ) K( t ) t Let us llustrate tese concepts by contnung wt te example n Secton..4. To see te varablty nerent to te process of rawng ensembles, we repeate 30 tmes te process carre out n te example above. Eac tme we rew an ensemble S, estmate te best ensty an plotte te result n Fgure a. For reference, te true ensty was plotte n black as before. otce tat eac estmate s qute fferent from te oters, but all surroun te expermental average (or expecte) ensty, wc s plotte n re. As was mentone before, ts ensty s a smoote verson of te true ensty. Fgure b sows a ban of wt stanar evatons aroun te average ensty. 8 In teory, te estmaton coul be unbase wen 0 an te kernel tens to te elta functon or f(x) as boune frequency content an te kernel s an eal low pass flter n te support of F(w), te Fourer transform of f(x). 9

11 Te mportant pont to remember from ts secton s tat te (pontwse) bas ncreases as te banwt ncreases, wle te opposte s true for te varance. Cross-valaton maxmum lkeloo s te juge tat fns te compromse between te two, keepng te bas as low as possble wtout ncreasng too muc te varance. To avo excessve varance ten some bas soul be ntrouce n te form of smootng. Ts smootng prouces greater eteroraton on te fast rsng an ecayng parts of te ensty, beng tose te ones contanng ger frequency components (Fgure b)... Te entropy estmate As we saw n Secton.., te sample S as a corresponng entropy value for eac moel n te ypotess space. Ts value can be use, troug te maxmum lkeloo prncple, to fn te best moel n te space. As expecte from a score use for comparng moels, t s a measure of te global ft of eac moel (recall te scusson about relatve entropy at te en of Secton..). Snce we only ave lmte nformaton about te process (contane n te fnte sample avalable), we o not expect ts score to be nfallble n scrmnatng between moels. It s only an estmate of te gooness of ft between te moel an te process. An as an estmate, t s perturbe by nose. To llustrate ts we return to te example. To create Fgure a, multple ensembles were rawn from te process an corresponng ensty estmates were compute. Togeter wt eac ensty estmate came a score (te lowest entropy) estmate. Tese are plotte as blue crcles n Fgure 3. Usng te fact tat te strbuton of te entropy estmate s asymptotcally normal [0], a Gaussan was ftte to tem. For comparson, also te true entropy value of te true ensty was plotte as a black vertcal lne. Altoug all te samples were rawn from te same process, te compute entropes ffer from te true entropy. Even more, tey o not even surroun te true entropy. Ts bas as ts orgn n te loss of score (entropy) ue to te smootng apple, an t coul be compensate (n teory) usng tecnques smlar to tose tat we wll later evelop to asses te varance. But, soul t be compensate? We fn tat t soul not, snce t correspons to a real eteroraton of te ensty estmate tat soul be taken nto account wen comparng moels. Moreover, compensatng te score only affects te result of te comparson but oes not mprove te ensty estmate. We also note a curosty n Fgure 3: tere are some ensty estmates tat explan te ata better tat te true ensty. Ts s because tese estmates are only tryng to explan te ponts n te observe ensemble. Te re lne n te fgure correspons to te entropy of te average Fgure : From eac sample a fferent estmate of te ensty s obtane. a) Multple ensty estmates (n blue) from multple samples S. Te average estmate s sown n re. b) A ban (n cyan) of two stanar evatons aroun te pontwse average of te estmates (n re). 0

12 Fgure 3: Eac ensty estmate as a corresponng entropy estmate. Densty of te entropy estmates n blue, true entropy n black an entropy of te average ensty n re. ensty. How ten, can a jugment between moels be attrbute to an ntrnsc fference between te moels an not to ranom effects prouce by te nose? We nee frst an estmate for te varance of te entropy estmate. Joe [] erve explct expressons for te bas an varance of ts estmate. Unfortunately te true ensty s requre n te computaton, makng t of lttle practcal use for us, except for provng an ea of te orer of convergence of te errors n (te sample sze), (te banwt) an (te menson of te oman). A practcal alternatve s explore n te next secton...3 Te bootstrap estmate Bootstrap, [], s a useful tool to compute error measures for ensty estmate functonals. Te ea s smlar to wat was one n Fgure 3. In orer to compute te varablty (or even te strbuton) of te estmate, we take many samples from te process an compute te functon of nterest for eac one (Fgure 4a). Te problem wt ts approac s tat usually we only ave one sample S from te process to be stue, an t s not possble to get more. Te soluton propose by te bootstrap tecnque s to generate (rectly or nrectly) new samples from te orgnal sample S, an use tese to asses te varablty of te functonal (fgures 4b an 4c). Te classc bootstrap meto works as follows (see Fgure 4b):. A sample of sze #S s rawn, wt replacement, from te orgnal sample S.. Ts new sample s use to compute te esre functonal (n our case, te sample s frst use to estmate a ensty tat s ten plugge-n nto te entropy functonal). 3 Steps an are repeate to obtan many estmates, an n turn an estmate for te varance of te esre functonal s estmate (te entropy n our case). As just presente, ts meto can not be use n our case, snce repeate sample ponts wll force te cross-valaton sceme to coose te soluton wt null banwt an eltas n te ata ponts (as explane n Secton..3). Instea, we use a varaton of te classcal bootstrap tecnque, known as smoote bootstrap [, 3]. In ts approac, we raw te new sample from te estmate of te ensty (tat we ave to fn anyway), an not from te orgnal sample tself (see Fgure 4c):. A ensty estmate s constructe from te sample, as explane n Secton... A new sample s rawn from ts ensty an use to compute te esre functonal; te form of te ensty estmate makes rawng samples from t partcularly easy. 3. As n te classc bootstrap, step s repeate to obtan many estmates tat wll be use to

13 a) True Densty (known) f (x) Samples (as many as esre) S S D Densty Estmates fˆ( x) fˆ D ( x) Hˆ ˆ... H D Entropy Estmates b) True Densty (unknown) f (x) Sample (only one) S Bootstrap resamples S ˆ ˆ S B Bootstrap Densty Estmates ˆ * f ( x) ˆ * f B ( x) H ˆ * ˆ *... H B c) True Densty (unknown) f (x) Sample (only one) S Densty Estmate (only one) ˆ f SB ( x) Smoot Bootstrap resamples S ˆ ˆ S B Bootstrap Densty Estmates ˆ * f ( x) ˆ * f B ( x) Ĥ H ˆ * ˆ *... H B Fgure 4: Te Bootstrap an te Smoot Bootstrap. a) Stuaton n te example of Fgure 3. Snce te true ensty s known, t s possble to raw multple samples { S,..., S n }, an from eac one of tem estmate a new ensty an corresponng entropy. b) Classc bootstrap approac: only one sample S s avalable an te new samples are generate from t. c) Smoote bootstrap tecnque: a ensty estmate ˆ ( x) s frst constructe an ten use to raw new samples. f SB calculate te varance of te esre functonal s estmate. Step s performe only once an te obtane ensty s use many tmes n step to raw furter samples. Te ratonale ben te ea of te bootstrap (asses te varablty of Ĥ from * * tat of H ˆ ˆ... H B ) s supporte by te fact tat ( Hˆ H ) an ( Hˆ Hˆ ) * ave te same lmt strbuton [4] (see Fgure 4 for te exact efnton of tese varables). Let us use te example starte n Secton..4 to exemplfy tese concepts. Takng avantage of te fact tat te true ensty functon s known (n contrast to wat usually appens n a real scenaro), we can procee as n Fgure a to get many ensty estmates from corresponng ensembles rawn from te process. Ten, eac estmate (e.g., te -t) s use n two ways (see Fgure 5). Frst, to compute an estmate for te entropy ( Ĥ ) an secon to compute a collecton of ensembles ( S... B S ) tat n turn s use to prouce new ensty

14 True Densty (known) f (x) H Samples (as many as esre) S S D Densty Estmates fˆ( x) fˆ ( x) ˆ H D... ˆ H D Smoot Bootstrap resamples ( ˆ ( )... ˆ B B f x f ( x) ) an corresponng entropy ( Hˆ... Hˆ ) estmates. Te ea s to examne te B vablty of usng te varablty of Hˆ... Hˆ to estmate te varablty of Ĥ. In general we wll only ave one estmate of te score ( Ĥ n Fgure 4c) nstea of te many estmates ( Ĥ n Fgure 5) artfcally generate n ts case (wc was possble because we know te true ensty). Tus, we wll ave to resort to bootstrap resamples for an estmate of te varablty of te score. Fgure 6a sows te enstes for te fferent scores appearng n Fgure 5 for our toy example. Te true entropy, wc s te entropy compute from te true ensty, s sown n black. Te ensty of te entropy estmates Ĥ, compute from te ensty estmates f ˆ ( x ), s sown n blue. Te ensty of te entropy estmates H ˆ compute from te bootstrap ensty estmates ( fˆ j ( x) ) are sown n re. Fgure 6b compares te estmate for te varance of te B score ( ( Hˆ... Hˆ D ) σ to ts real value ( σ ( Hˆ Hˆ D ).... Ts estmate for te varance prouce values above te true value but of te expecte orer. Settng ase te problem prouce by te repeate ponts n te classc bootstrap resample (wc mgt lea to estmatng zero banwt kernels), t s not clear weter usng te smoote bootstrap wll mprove te performance of te estmator over te classc bootstrap n ts case, an f t oes, ow soul te smootng banwt ( SB n Fgure 4c) be foun [5, 6]. It s known tat te ensty tat s use to create te resamples n te bootstrap worl soul be base on a larger banwt SB tan te orgnal one (obtane usng cross-valaton maxmum lkeloo) [4], but tere are no specfc rules to coose t. Uner tese contons, we ece to use te orgnal banwt aware of te fact tat ts matter requres furter researc. In Secton.3.3 below we wll use te smoote bootstrap to fn an estmate of te error n te scores of te moels to be compare, an we wll take tese errors nto account to perform te comparson. S B S D Bootstrap Densty Estmates fˆ ( x) fˆ ( x) B D Hˆ... ˆ Fgure 5: Smulatng te Bootstrap. Relatonsp between samples an estmates use n Fgure 6. See text for etals. B H D.3 Curse of mensonalty If we a a sample contanng a very large number of observatons (relatve to te menson 3

15 Fgure 6: Denstes for te fferent scores efne n Fgure 5 for te toy example. a) True entropy n black, ensty of te estmate n blue, an estmates of te ensty estmate n re. b) True varance n blue an ensty of te varance estmate n re. of te space), te problem s solve followng te above escrbe framework. Unfortunately, ts s not usually te case. Even for very small protens, te number of egrees of freeom s n te tens or unres, an terefore, te number of structures neee to ecently estmate te ensty s probtvely large. Ts problem s very well known n statstcs an as been ubbe te curse of mensonalty. In ts secton we suggest a possble soluton, wc s applcable wen te ata as a certan property tat wll allow us to make, for a gven error, te sample sze vrtually nepenent of te number of resues (egrees of freeom) n te proten. Ts s one of te man contrbutons of ts artcle. So far, any reference to te partcular source of te nformaton was avoe, wat was escrbe n prevous sectons s val for any ataset. However at ts pont, sgnfcant smplfcatons can be obtane by explotng an ntrnsc caracterstc of te ensembles, for example, of fole proten conformatons. In te followng, we restrct our attenton to ts kn of ataset, altoug te same apples to any ataset wt te property we ntrouce n te next paragrap. An ensemble of fole proten conformatons, caracterze for example by te backbone torson angles, as te esrable property tat eac angle s manly relate to a small subset of te oter angles, an more mportant, s vrtually nepenent of te rest. Wat appens to one angle of te conformaton s manly affecte by te prevous an followng angles along te can an peraps by tose angles n ts spatal proxmty. Ts s so because a relaton between angles tat are far away wll be ar to explan wtout te ncluson of an angle lyng n between (te nfluence as to be transmtte n some way) an because te number of resues surrounng a gven resue s boune (ue to packng conseratons). We now formalze tese concepts..3. Dve an conquer We set out to estmate te ensty p ( x x,..., ), x 9 of a -mensonal ranom varable (recall from te begnnng of Secton tat =(M-), were M s te number of resues n te proten). Ts ensty can be wrtten as te prouct of two nepenent factors, 9 In prevous sectons te ensty to be estmate was calle f(.) n agreement wt te lterature on ensty estmaton. However, at ts pont, were propertes of probablty enstes are beng nvoke, te use of p(.) seems more natural. 4

16 ( x, x,..., x ) p( x x,..., x ). p( x, x x ) p,..., = 3 Te -subscrpts n te rgt an se were ntrouce to account for te fact tat any coornate (arbtrarly cosen) can be factore out nepenently of ts poston n te orgnal vector. Applyng ts step nuctvely, te ensty can be wrtten as te prouct of nepenent factors: p x, x,..., x = p x x,..., x. p x x,..., x... p x x. p x (6) ( ) ( ) ( ) ( ) ( ) 3 By constructon an n contrast to Hnton s PoE [0], te factors (experts) n (6) are nepenent. Snce, as state above, only a few torson angles (strongly) nfluence a specfc angle, te rest can be scare from te contonng set for tat partcular angle. Assume tat we know tat any angle s (strongly) nfluence by no more tan n oters (wt n muc smaller tan ), ten 0 p ( x x,..., x ) p( x x,..., x n ). p( x x,..., x n )... p( x x ). p( x ), Te orgnal problem of estmatng a -mensonal ensty was reuce to tat of estmatng nepenent enstes n menson (n+) or lower. Usng propertes of te proten conformatons, we ave ten sgnfcantly reuce te problem mensonalty (assumng of course tat n s sgnfcantly lower tan ). For our purposes, t s more practcal, an equally val, to stop te factorzaton earler tan before an conser te followng factorzaton n (almost) unform menson factors: ( x x,..., x ) p( x x,..., x n ). p( x x,..., x n )... p( x x ) p,..., (7), n+ Woul t ave been better to factor out more tan one varable at a tme? Tat s, coul te followng be a better factorzaton tan (7)? p ( x, x,..., x ) p x, x x,..., x n... p( x,..., x ), n (8), + It s easy to see tat no, an tereby (7) s optmal n ts sense. Te entropy corresponng to te ensty n (7) s (see Secton.. an [5] for te use basc propertes of te entropy): ( x x,..., x ) H ( x x,..., x n ) + H ( x x,..., x n ) H ( x x ) H =,..., (9), n+ From te conseratons mae n Secton.., t follows tat ts s te expresson tat as to be mnmze. In Secton. we sowe ow to compute eac of te summans n (9). In te next sectons we explan ow to fn te new parameters ntrouce n ts secton, namely te - nexes (te contone an contonng varables) an te number n of contonng varables. In ts searc t wll be necessary to access repeate tmes te values of te entropy summans, for tat reason t makes more sense to compute tese summans n avance, store tem n a atabase an only ten start te searc. It can be farly objecte tat snce te number of sets of n varables grows very fast as n ncreases, ts atabase can be mpractcally bg. Our experments sow tat, at least for te teste atasets, te nterestng subsets of n varables can be foun ncrementally by ang varables to te nterestng subsets of (n-) varables. Fgure 7a presents an example of ts fact, extracte from te ataset use n Secton 3.3. Ts fact as been extensvely observe n ts 0 For smplcty we assume te number of contonng varables (n) to be equal n every factor, but ts s not strctly necessary. Actually, we only sowe ow to compute entropes of te form H(x,,x ). Entropes of te form H(x x,,x ) can be compute smply usng te relatonsp: H(x x,,x ) = H(x,,x ) - H(x,,x ). 5

17 a b Fgure 7. Progressve scovery. a) One example: Te best set of two contonng varables for varable x s a superset of te best set of one contonng varable. In ts grap te re stars sow te entropy value for te corresponng sets (a selecton of tem s also name n re). Te blue lnes connect eac contonng set wt ts best subset avng one less varable. b) D-Hstogram of ponts of te form (I c,i p ) corresponng to te blue lnes n Fgure 9, were I c s te nex (normalze to be between 0 an ) of te entropy of te cl set n te sorte set of entropes of clren an I p s te nex of te entropy of te parent set n te sorte set of entropes of parents. For example, te lne connectng H( 37 38) an H( 37) n a) correspons to one observaton of coornates (0,0) n ts grap, snce bot sets ave te lowest entropy n ter respectve levels. ataset an oters, an s ocumente n Fgure 7b..3. Dscoverng te epenences Let us frst fx n, an fn te prouct of enstes of te form n Equaton (7) tat best explans te ata (as te smaller entropy) for tat n. Let us frst ntrouce some notaton to smplfy te exposton. Let I, be a permutaton of te ntegers from to (e.g., I,5 = (3,5,,4,) ), I j, be te nces of tat permutaton from te poston j to (e.g., I 3,5 = (,4,) ), an j I, n Also let C { x,..., x } j j te j t 3 nex of tat permutaton (e.g., I for te example above). n j,5 = = be te set of n coornates contonng te coornate x j n equatons 6

18 (7) an (9). Wat s left to o ten s to fn te -nexes subject to te contons:,..., = I : te -nexes must efne a permutaton of te coornates (snce every. ( ), coornate s eventually factore out, but te orer s arbtrary); an n. C j I j+, : te set of contonng varables for j s a subset of te varables tat follow j n te permutaton (snce only tose varables not yet factore out can be use to conton). As state at te begnnng of ts secton, t must contan n varables. Once te permutaton I, s foun (we wll explan ow we o ts below), an assumng n tat all te terms of te form ( ) H x j C j ave been prevously compute an store n a atabase, n te contonng coornates C j for eac contone varable j can be stragtforwarly scovere wt te followng smple rule: just conser from te set + te n nexes {,..., } j n j n tat mnmze ( ) I j, H x j C j. It can be sown tat for a gven permutaton, no oter contonng set can o better. Havng establse ts, te notaton can be extene furter to rewrte Equaton (9) as n n n n H ( I, ) = H ( I, C ) + H ( I, C ) H ( I, C n ) + H ( I n+, ) (0) n were te C s are calculate as just explane. Ts notaton stresses te fact tat only te permutaton as to be foun, te oter sets follow usng te smple rule explane above. Incentally, we see tat H ( I, ) s not affecte by permutatons of te last n postons of I,. To select a goo permutaton I,, a smple genetc algortm base on te eas of mutaton an selecton was use. It bascally works as follows:. Select a ranom ntal permutaton I, (e.g., I,5 = (3,5,,4, ), assumng =5 for te sake of te example). = H I, usng te pre-store values.. Evaluate ts entropy, ( ) H, 3. Coose two postons at ranom (e.g., an 4), an swap tose coornates (to obtan I (3,4,,5, ) ). ', 5 = H '= ',. 5. If te new permutaton s better tan te prevous one ( H <' H ), keep t ( I, I', ). 6. Wle te entropy H s ecreasng an te maxmum number of teratons as not been reace, return to step 3. Two mofcatons to ts algortm were foun to mprove ts performance. Frst, only ajacent postons were swappe n Step 3, allowng for a more effcent computaton of H n Step 4; an secon, te swaps are accepte or rejecte wt certan probablty tat epens 4. Evaluate te entropy for te new permutaton, H ( I ) exponentally n H = H ' H. By ts nature, ts algortm s prone to fn local mnma. To avo or reuce ts problem te algortm s run many tmes for eac value of n, an n eac case te best moel s kept for tat partcular n. Intutvely, for small n s, eac factor n Equaton (7) wll be well estmate (snce te ensty beng estmate s low mensonal), but te epenences between varables mgt be lost. Conversely, for large n s, te epenences wll be capture, but te qualty of eac factor wll be eterorate. Clearly a compromse must be mae. Ts s etale next. 7

19 .3.3 Selectng te orer n In Secton..3 we explane ow to use smoote bootstrap to compute te varance of entropy estmates lke te summans H n ( I, ) n Equaton (0). We frst exten ts result to C compute te varance for te entropy of te permutatons, ( I ) H,. Te summans n te rgt an se of Equaton (0) are nepenent of eac oter, an ence te varance of ts sum s te sum of te varances of eac summan n σ, H I, C = σ H ( I ) = σ n ( ) + H ( I ). Also, recall tat eac summan n Equaton (0) s n+, approxmately normally strbute, an ten ts sum also s. At ts pont of te algortm, several moels ave been foun, one for eac n. Eac moel as a corresponng score (entropy) as etermne by te maxmum lkeloo prncple (see Secton..), an eac score as a corresponng varance, as compute usng smoot bootstrap (see Secton..3). To sum up, te results obtane so far look lke ts: n Moel Entropy Varance 0 0 p ( x ). p( x )... p( x ) H ( I ) 0, σ 0 H ( I ) ( x x ) p( x x )... p( x ) p H ( I, ). n + 0, σ H ( I ) m m p ( x x,..., x m ). p( x x,..., x m )... p( x,..., x ) H ( ) m+ m I, σ m H ( ) m I, Wc one of te m moels s best? Accorng to te maxmum lkeloo prncple, te one tat as te lowest entropy. But we assgne to te moels, va te smoote bootstrap, not just a score, but a probablty ensty of scores. We ave ten to efne a crteron to coose between tose moels. Assume we ave two moels A an B wt n A an n B contonng varables respectvely, n A > n B. Assume also tat te ensty of te entropy estmate for eac moel s as sown n Fgure 8. Wc moel s better? Intutvely we soul coose moel A, snce te probablty tat t performs better tan moel B (agan accorng to maxmum lkeloo), s ger tan te probablty of te reverse case. Obvously f te opposte were true, we soul coose moel B. If neter one of te possbltes s true, t makes sense to coose te computatonally less expensve opton, meanng te moel wt te smallest number of contonng varables. Formalzng ts, we en up wt a selecton rule (for n A > n B ): k P( H A < H B ) > kp( H A H B ) P( H A < H B ) > + k, Fgure 8: Coosng te moel s orer (n). Densty of te entropy estmates for two ypotetcal moels A an B. Wc one soul be cosen? 8

20 were k s a parameter ntrouce to account for te fact tat moel A s computatonally more expensve an soul be convenently ajuste. Knowng tat te enstes of te entropy estmates are (approxmately) Gaussan wt mean an varance gven by te two last columns of te prevous table respectvely, te probablty can be compute (for nstance usng Monte Carlo), an te conton evaluate for te moels, resultng n a unque moel beng selecte as te best representatve for te ata. Ts moel may not be suffcent, but t s te best tat can be obtane wt te avalable ata accorng to our optmalty crteron. Ts s an mportant concept: our propose framework fns te best possble moel (wt respect to te selecte optmalty crteron), an f ts one s not goo enoug, more ata wll ave to be collecte, but te ata was use effcently. Fgure 9 summarzes te proceure explane n ts secton. Ts conclues te ervaton of te computatonal framework. We procee now to ts valaton an applcaton to real ata. 3 Results an applcatons In orer to test our meto, we start wt a set of artfcally constructe examples. In Secton 3. we present tese results. In Secton 3. we apply te meto to two real atasets. Frst an ensemble of conformatons of te resues long -arpn tryptopan zpper [7] s analyze (Secton 3..). Ten n Secton 3.., te vlln eapece [8], a sgnfcantly more complex pepte avng 36 resues s stue. 3. Valaton va artfcal examples Four artfcal atasets wt fferent epenency levels were constructe to valate te propose meto. Tose are scematcally represente on te left of Fgure 0. In all tese atasets te sample sze () was 500, an te menson of te ata ponts () was 6. Te meto explane n Secton was apple to eac ataset. In all cases te epenences were correctly scovere. Te rgt se of Fgure 0 sows te evoluton of te entropy as more contonng varables are allowe n te moel for eac ataset. In eac case, te correct number of contonng varables (n) was foun. It s wort notng tat contrarly to wat s expecte for an nfnte ataset, ang unnecessary contonng varables eterorates te moel. For fnte sample szes, te prce to pay for conserng more epenences s more smootng, wc n turn eterorates te moel. As explane before, a compromse must be mae, an ts s one ere automatcally, by selectng te optmum n usng te rule escrbe n Secton.3.3 an te Entropy Database Searc Factors for Best Proucts (ML + GA) Factors n C... C n I, & n n Compute Factors Entropy Varance (SB)..3 n ( I ) H n, Select Best Moel Prouct Mean Entropy σ n n H ( I, C ) Factors Entropy Varance σ H ( ) n I, Best Moel Prouct Entropy Varance Compute Factors Entropy Varance.3.3 ML Maxmum Lkeloo GA Genetc Algortm SB Smoote Bootstrap Fgure 9: Te complete proceure explane n Secton.3. Te blue numbers n te corner of eac box correspon to te sectons n wc te partcular computatonal box s explane. 9

nformaton n te graps of Fgure 0. 3. Proten examples 3.. -arpn tryptopan zpper [3, 6, Our frst real ensemble conssts of 48 conformatons of te -arpn tryptopan zpper 7], a pepte avng resues.

21 nformaton n te graps of Fgure Proten examples 3.. -arpn tryptopan zpper [3, 6, Our frst real ensemble conssts of 48 conformatons of te -arpn tryptopan zpper 7], a pepte avng resues. Te backbone of tese peptes can be escrbe by torson angles ( φ s an ψ s), an consequently we nee to estmate a mensonal probablty strbuton functon (pf). In Fgure a, te evoluton (wt respect to n, te number of contonng varables) of te entropy s splaye. ote tat nclung more tan two contonng varables n te ensty factors oes not sgnfcantly mprove te estmate (compare to tree), an can even eterorate t (compare to four). As explane n Secton.., te magntue of te mprovement/eteroraton soul be juge relatve to te varance corresponng to te score of eac moel (represente by te blue ban n Fgure a). Applyng te algortm of Secton.3.3, we select te moel wt two contonng varables as te one tat best represents te ensemble. We o not clam tat te true mensonalty of te process s two, but only tat for te current avalable sample, te beneft obtane n computng a pf usng te atonal epenences accounte for wen nclung Dataset : x ~ x ~ x ~ 3 (.6,0.95 ) x4 ~ ( 3.,0.45 ) (.8,0.74 ) x5 ~ ( 5.,0.37 ) ( 5.8,0.54 ) x ~ ( 0.6,0.79 ) 6 Dataset : Dataset 3: Dataset 4: x ~ x ~ x ~ 3 x ~ x ~ x ~ 3 x ~ w, w, x ~ x ~ 3 w, ( 3.,0.80 ) x4 ~ (.9,0.56 ) (.5,0.57 ) x5 = x + x + w ( 0.3,0.50 ) x6 = x3 + x4 + w w ~ ( 0,0.05 ) (.3,0.47 ) x4 ~ ( 6.0,0.33 ) ( 3.3,0.53 ) x5 = x + x + x3 (.6,0.79 ) x6 = x x3 + x4 w ~ ( 0,0.05 ) ( 3.7,0.34 ) x4 = x (.8,0.4 ) x5 = x ( 3.4,0.50 ) x6 = x3 w, w ~ ( 0,0.05 ) 3 + w + w + w 3 + w + w Fgure 0: Valaton wt four artfcally generate atasets. Left: te relatonsp between te varables s scematcally represente. Mle: a formal escrpton of te relatonsps. Rgt: evoluton of te entropy as more contonng varables are nclue. Te cyan ban represents te 99% confence nterval. 0

22 more tan two contonng varables, s smaller tat te arm one by te atonal smootng requre. It s nterestng to observe te epenences between varables foun by te algortm, Fgure. otce tat as expecte, te angles are often contone on te ajacent (along te can) angles of te same kn (φ or ψ ), or on te corresponng angle of te opposte kn. Ts s furter evence tat te algortm s ong wat t soul. An nterestng fact to note s tat we fn more frequent contonng on te ψ s; ts can be explane by te asymmetrc roles of φ an ψ n te Ramacanran map. It s appealng tat ts effect emerges from te combnaton of te formalsm an te smulaton ata, rater tan sometng wc must be nclue by an. We can gan nsgt of te structure of te ensemble by explctly computng te ensty value n te avalable observatons usng te moel selecte by our framework. Fgure b presents a plot of tese values sorte from te least lkely to te most. It can be seen n ts fgures tat a few conformatons are muc more lkely tan te rest (note tat te log-ensty s plotte) an a few conformatons are muc more unlkely tan te rest. Te expermentally etermne structures (PDB [9] entry LE0), also sown n te fgure, can be seen to ave smlar probablty enstes locate between tese two extremes. Are tose conformatons smlar to eac oter, or more precsely, are tere many moes n te ensty or just one? To conser ts queston, wc nees explct pf estmaton as one ere, we plotte n Fgure c te stances between all te conformatons. Snce te conformatons are sorte (as explane above), te stances between te most probable conformatons le n te upper rgt corner. Te pattern n ts area agrees wt a unmoal Contone Varable Contonng Varable Fgure : Depenences scovere between te varables. Depenency agram wen two (n blue) an tree (n re) contonng varables are allowe. Te contonng varables for a gven varable appear as φ ( n row n ots n te corresponng row. For nstance, wen two contonng varables are allowe for, { cos } were x an y are Te stance was compute usng te angles formula: ( x y) = ( x y ) te topmost squares), φ 3 (3 r column n te leftmost squares) an two are fferent selecte. -mensonal If one more conformatons varable s allowe, as escrbe φ 6 s also by ter nclue. torson angles. ψ ( n column n te rgtmost squares) =

b a c Fgure : Analyss of te estmate ensty. a) Evoluton of te entropy as more contonng varables are nclue. As before, te ban represents te 99% confence nterval.

Te expermentally etermne structures (PDB entry LE0), wc were not part of te set use for ervng te pf, are sown n green. c) Dstances between all te structures sorte as n b.

23 b a c Fgure : Analyss of te estmate ensty. a) Evoluton of te entropy as more contonng varables are nclue. As before, te ban represents te 99% confence nterval. b) Logartm of te ensty for eac structure (observaton). Te structures are sorte from te least probable (on te left) to te most probable (on te rgt). Te expermentally etermne structures (PDB entry LE0), wc were not part of te set use for ervng te pf, are sown n green. c) Dstances between all te structures sorte as n b. ) Dstances between te 0% most probable structures sorte to sow clusters. Use te color bar prove to translate nto te corresponng numerc values. ensty. Ts suggests a way to coose a representatve for te ensemble, f one nees to be selecte, smply as te most probable conformaton. Zoomng n on te top ten percent of te structures owever, sows tat tese structures cluster aroun two stnct moes (Fgure ), tat le relatvely close to eac oter. Te tree most lkely conformatons an te tree least lkely conformatons from te avalable ensemble are sown n Fgure 3. As expecte, te most lkely conformatons ave more yrogen bons an ence are more stable. 3.. Vlln eapece Te secon real ensemble we analyze conssts of 543 conformatons of te vlln eapece molecule [3, 6], a pepte avng 36 resues (70 torson angles). Te same tests escrbe n te prevous secton were performe on ts ataset an smlar conclusons can be rawn. Fgure 4 sows te corresponng results. In ts case, snce te sample s more tan tree tmes bgger tan n te prevous example (543 versus 48), te system s able to capture tree contonng varables nstea of two (see Fgure 4b). As before, few conformatons are Fgure 3: Te most lkely an unlkely conformatons. Te tree most lkely (top) an less lkely (bottom) conformatons accorng to te ensty compute by our propose framework.

a b ) c ) ) Fgure 4: Analyss of te estmate ensty for te secon ataset. a) Logartm of te ensty for eac smulate (black lne) an te expermental (green crcles) structure.

24 a b ) c ) ) Fgure 4: Analyss of te estmate ensty for te secon ataset. a) Logartm of te ensty for eac smulate (black lne) an te expermental (green crcles) structure. Te structures are sorte from te least probable (on te left) to te most probable (on te rgt). Te ensty of te average natve s ncate by a blue square. b) Depenency of te entropy on te number of contonng varables. c) Dstances between all te structures sorte as n a. ) Dstances between te 0% most probable structures sorte to sow clusters. Use te color bar prove to translate nto te numerc values. muc more lkely tan te rest (Fgure 4a) an tose are stuate n wat appears to be te moe of a unmoal ensty (Fgure 4c). However, wen ts moe s carefully examne, t splts nto several moes (Fgure 4). For ts molecule, an ensemble of structures etermne usng MR tecnques can be foun n [30], togeter wt ts mnmze average ( natve ) (PDB entry VII). We can estmate te probablty ensty of tese structures an compare t wt te probablty ensty obtane Cluster Cluster Cluster 3 Most Lkely Least Lkely Fgure 5: Selecte conformatons from te ensemble. Te most probable conformatons of te tree (most probable) clusters of fgure 4 are n te frst tree rows, one cluster per row. Te two least probable conformatons of te wole ensemble an te natve appear n te fourt row. 3

25 for te oter structures n te ensemble. By ong ts we foun tat te expermental ensemble (crcles an square n Fgure 4a) belongs to te group of most unlkely structures. Ts mgt ncate tat te ensemble of smulate structures s not yet correctly capturng te wole nformaton about te natve state 3 an also rases te queston of ow muc confence soul be gven to a sngle (expermental) structure. Te most an least probable conformatons are sown n Fgure 5. To furter stuy ts penomenon of low probablty assocate wt te natve expermental structures (extracte from te work of McKngt et al. [30] ), we plotte te value of eac compute factor (from Equaton (7)) n ts corresponng spatal locaton (Fgure 6). In te same fgure we nclue a grap of te factors were te natve structure appears to ffer te most from te smulate structures, resultng n low factor values, an tereby small probabltes. For te sake of vsualzaton, an snce accorng to Fgure 4b not muc explcatve power s obtane wen usng more tan one contonng varable, we only nclue one contonng varable n te graps of te factors. Te tecnque presente n ts work can not only be use to asses te probablty of an exstng structure, but also to generate novel structures avng (presumably) g probablty. To obtan tese structures we start by coosng a pont n te space of conformatons an follow te recton of te graent of te compute pf, untl a local maxmum s reace. In Fgure 7, left, te cange of te probablty s sown for tree fferent groups: tose tat starte at te most unlkely conformatons of te orgnal ensemble (n re), tose tat starte at te most lkely conformatons of te orgnal ensemble (n blue) an tose tat starte at conformatons of te orgnal ensemble avng ntermeate probablty (n green). It can be seen n te fgure tat, n every case, but especally for te re (most unlkely) structures, tere was a marke mprovement n te probablty of te structures. Wy s te probablty ncreasng? One explanaton s tat te optmzaton s assemblng togeter popular parts to create te new structures, fnng te consensus among te observe structures for eac regon of te proten. Tese new conformatons automatcally obtane from te compute pf can be use for example as novel ntal contons for molecular ynamcs or for proucng new canates for g resoluton proten esgn. It s also of nterest to asses te natveness of te generate structures. As we repeately mentone n ts work, see Introucton, usng a sngle structure to caracterze te natve state may not be te best approac. In ts case, te stance from te new structures to te expermental ensemble (Fgure 7, center) s of te same orer as te nternal varablty of te expermental natve observatons temselves (Fgure 8). Lackng te necessary observatons of te expermental structure to follow te approac ntrouce n ts artcle, we ave to resort nstea to te stance () below to te natve ensemble as a measurement of natveness (te stance to te ensemble s gven by te mnmal stance to all te elements n t). Tese results are plotte n Fgure 7, center. ote tat te stance for te majorty of te most lkely (blue an green) structures s mprove by te optmzaton. In Fgure 7 rgt, were we plotte te log-probablty of te new structures versus ter stance to te natve ensemble, t can be seen tat tere s a tenency for te closest structures to te natve ensemble to be te most lkely. Tereby, te pf s fnng from all te molecular ynamcs results, te best ones accorng to ts stance. Tese best ones ten lea to new conformatons, new samples of te conformaton space, va te pf graent ascent tecnque mentone above, Fgure 7 left. 3 Possbly ue to an naccurate force fel use n te smulaton. 4

26 Fgure 6: Contrbuton of eac factor to te overall ensty of te average natve. Top: Te ensty of eac factor (φ s on te rgt,ψ s on te left) s represente by te color of te corresponng ball. Te four lowest factors are labele an a grap s nclue at te bottom. Bottom: Grap of te log-ensty for selecte factors. Black ots represent te molecular ynamcs samples use to compute te ensty; wte crcles represent te expermental structures; gray square s te natve (expermental average) structure; wte x an * represent te least an most lkely structures of te molecular ynamcs ensemble respectvely. ote tat te natve as some factors not locate at te top of te ensty, tereby explanng wy te overall probablty of ts sngle structure s low. 5

27 Fgure 7: ew structures. Tose were obtane by graent ascent startng at te most unlkely conformatons of te orgnal ensemble (n re), te most lkely conformatons of te orgnal ensemble (n blue), an orgnal conformatons of ntermeate lkeloo. 4 Conclusons an scusson A meto to estmate a probablty ensty functon n te space of fole proten conformatons was evelope. Ts meto oes not ave any free parameters to fx oter tan te sape of te kernel use (Secton..), an reles on funamental results from estmaton an nformaton teory an on te assumpton tat only a few angles strongly nfluence a specfc torson angle (or n general, a few varables strongly affect anoter specfc varable). Ts s neee n orer to reuce te mensonalty of te problem. Wt our framework, we not only obtan te best possble pf (moulo our optmalty crtera), but also learn te orer of te moel (n) an explctly fn te torson angles (varables) epenences. Better estmates mgt be obtane f constrants between te angles (e.g., te allowe regons of te Ramacanran plot) an/or energy prors (.e. Boltzmann wegtng) are nclue n te moel. Obvously ts can be stragtforwarly one an mprovements are expecte. Oter acknowlege places for mprovement nclue te optmzaton proceure use to coose te banwt for te kernels (wc may be substtute by more effcent metos); te Fast Gauss Transform or Dual Tree metos can be use to spee up te computaton of te entropes [3, 3] ; an te genetc algortm explane n Secton.3. to scover te epenences. Also, more researc s requre for selectng te banwt use n te smoot bootstrap step (Secton..3), unless ts problem s altogeter avoe by usng oter metos (e.g., te jackknfe [] ) wc o not prouce resamples avng repeate observatons. Fgure 8: Dstance matrx between expermental conformatons. Angular stance between te expermental structures (from [3] ) compute usng (). 6

28 Havng a meto to statstcally caracterze te fole ensemble, t coul be temptng to apply te same meto to oter ensembles as well. Recall tat n orer to apply ts propose meto, te ensemble must satsfy te basc assumpton about te epenency between te torson angles (or te varables use for te structure representaton n general). In partcular, for te ensemble of partally fole conformatons, ts assumpton s less val ue to te versty of long range nteractons tat can take place, beng te number of epenences n tat are to be taken nto account n ts case qute large, an tereby, not obtanng a mensonalty reucton as sgnfcant as te one obtane n te fole case. Agan, ts meto can be easly extene to oter escrptons of protens suc as tose nclung te se cans or even to completely fferent atasets, f tose satsfy te basc assumpton state above. Furtermore, we expect better results f more complete escrptons are supple (e.g. te se can angles/par-wse stances are nclue), snce tose can also be nclue as contonng varables, f tey turn out to be te most nformatve. Our results suggests tat for a gven accuracy, te number of sample ponts n te ensemble oes not nee to grow exponentally wt te number of resues n te pepte, but only wt te number of tose actually affectng eac oter. In oter wors, te true menson of te ataset s muc smaller tan te total number of torson angles, beng efne only by te nteractons n small negboroos an not across te wole proten. Wen comparng structures n orer to valate our tecnque (Secton 3..), we use te angles stance (wc, by te way, s ntmately relate to te Von Mses kernels): x, y { cos x y } ( ) = ( ) = Te coce of ts metrc s someow arbtrary, unermnng te results erve usng t. Ts coce of metrc s not completely compatble wt te ensty estmate propose n ts work. If we were to beleve tat Equaton () s te rgt metrc to use for ts space, we soul ave cosen symmetrcal kernels n -mensons. Conversely, snce as explane before, ts type of kernels s a very ba coce, we tnk tat usng te metrc of () s not optmal. In Secton.., wle presentng te kernel ensty estmaton tecnque, a stance was mplctly use to asses te contrbuton tat eac observaton as on te pont were te ensty s beng estmate. ow tat a ensty as been foun, t may be nterestng to use ts relatonsp between stances an enstes n te oter recton, to erve a stance from te ensty. Ts natural stance takes nto account te structure of te ensemble, an ts wll be furter stue elsewere. Snce t s commonplace n ts fel to use te C α -RMSD to compare conformatons, for completeness, we conclue by nclung n Fgure 9 te equvalent of fgures 7b-c an 8 compute usng ts metrc nstea of (). Ts approac presents two man ffcultes. Frst, t s not trval to compute te 3 coornates of te C α s from te torson angles (we use te stanar Eng an Huber [33] angles for te reconstructon). Secon, small varatons n te torson angles can prouce large varatons n te 3 coornates. As before, we compute te RMSD stance to te wole expermental natve ensemble, not just to an average structure. ote once agan te automatc clusterng of te most probable conformatons, as compute wt our pf, an ow t gets closer to te expermental ensemble. ote also te large nner varablty (left fgure) among te expermental conformatons temselves, of te same orer of te varablty of te new conformatons create by our algortm wen startng from te most probable ones. Of course, f ts kn of comparson was ntene, a fferent set of features soul ave been cosen n te frst place (not te torson angles). () 7

29 Fgure 9: Graps corresponng to fgures 7b-c an 8 but compute usng C α -RMSD. To conclue, base on statstcs an nformaton teory, we ave presente a framework to compute te strbuton of proten conformatons, wt possble applcatons from proten comparsons to conformaton space samplng to g resoluton proten esgn. Tese applcatons, as well as te explotaton of te pf to efne new stance functons (to be reporte elsewere), may promote a sft from te current empass on sngle structures to te conseraton of wole ensembles, allowng all te avalable nformaton to play a role. Acknowlegements: We tank Bojan Zagrovc for te carefully reang of te paper, s suggestons (especally tose resultng n te ncluson of fgures 7 an 9) an for pontng out relevant references. We also tank Dav Baker, Alexaner Grossberg an Brgt Grun for elpful scussons, an OR, SF, GA an DARPA for te founng. Ts work was carre out n part usng computng resources at te Unversty of Mnnesota Supercomputng Insttute. 5 References. Repng W, Habeck M, lges M. (005) Inferental structure etermnaton. Scence 309: Roter D, Sapro G, Pane V. (005) Statstcal caracterzaton of proten ensembles. RECOMB 005 Poster Abstracts: Zagrovc B, Snow CD, Kal S, Srts MR, Pane VS. (00) atve-lke mean structure n te unfole ensemble of small protens. Journal of Molecular Bology. 33: Sortle D, Smons KT, Baker D. (998) Clusterng of low-energy conformatons near te natve structures of small protens. Proceengs of te atonal Acaemy of Scences, USA. 95: Braley P, Msura, K. M. S., Baker D. (005) Towar g-resoluton e novo structure precton for small protens. Scence 309: Pane VS, Stanfor Unversty. (005) Folng@ome strbute computng. Avalable: ttp://folng.stanfor.eu/ va te Internet. 7. Baker D. (003) Te baker laboratory. Avalable: ttp:// va te Internet. 8

ENTROPIC QUESTIONING

ENTROPIC QUESTIONING NACHUM. Introucton Goal. Pck the queston that contrbutes most to fnng a sutable prouct. Iea. Use an nformaton-theoretc measure. Bascs. Entropy (a non-negatve real number) measures