BIOINFORMATICS: PAST, PRESENT AND FUTURE. Susan R. Wilson Mathematical Sciences Institute, Australian National University, Australia

Size: px

Start display at page:

Download "BIOINFORMATICS: PAST, PRESENT AND FUTURE. Susan R. Wilson Mathematical Sciences Institute, Australian National University, Australia"

Bryan White
5 years ago
Views:

1 BIOINFORMATICS: PAST, PRESENT AND FUTURE Susan R. Wlson Mathematcal Scences Insttute, Australan Natonal Unversty, Australa Keywords: Bonformatcs, bologcal sequence analyss, sequence algnment, hdden Markov models, evolutonary models, phylogenetc reconstructon, gene expresson analyss, mcroarray data, desgn and analyss of mcroarray experments, systems bology, federated data ntegraton, bo-grds, proteomcs, functonal genomcs, comparatve genomcs Contents 1. Introducton 2. Bologcal sequence analyss 2.1. Background 2.2. Scorng Systems 2.3. Sequence Algnment 2.4. Assessment of Sgnfcance Overvew of Basc Theory Complcatons and Developments 3. Applcatons of hdden Markov models n bonformatcs 4. Evolutonary models and phylogenetc reconstructon 5. Gene expresson analyss 5.1. Background 5.2. Issues Concernng Outcome Measures 5.3. Expermental Desgn 5.4. Analyss of Mcroarray Data. 6. Statstcal methods n proteomcs 7. Systems bology 8. Federated data ntegraton and bo-grds 9. Dscusson Glossary Bblography Bographcal Sketch Summary Bonformatcs s a newly emergng feld that s ncreasngly beng wdely vewed as a more fundamental dscplne at the ntersecton of the bologcal scences wth the mathematcal, statstcal, and physcal scences and chemstry and nformaton technology. Ths revew chapter touches brefly on those aspects of bonformatcs that wll be of nterest to bometrcans, ncludng bologcal sequence analyss and algnment (Secton 2), applcatons of hdden Markov models n bonformatcs (Secton 3), evolutonary models and phylogenetc reconstructon (Secton 4), gene expresson analyss and mcroarrays (Secton 5), proteomcs (Secton 6) systems bology (Secton 7) and federated data ntegraton and bo-grds (Secton 8), fnshng wth a bref dscusson hghlghtng the exctng future of the feld durng the comng decades.

2 1. Introducton Bonformatcs s an emergng feld that was once consdered to be the part of computatonal bology that explctly dealt wth the management of the ncreasng number of large databases, ncludng methods for data retreval and analyses, and algorthms for sequence smlarty searches, structural predctons, functonal predctons and comparsons and so forth. There has been phenomenal growth of lfe scence databases. For example, the most wdely used nucleotde sequence database s Genbank that s mantaned by the Natonal Center for Botechnology Informaton (NCBI) of the US Natonal Lbrary of Medcne; as of February 2003 t contaned 28.5 bllon nucleotdes from 22.3 mllon sequences. Its sze contnues to grow exponentally as more genomes are beng sequenced. However, there s a very large gap (that wll take a long tme to fll) between our knowledge of the functonng of the genome and the generaton (and storng) of raw genomc data. Bonformatcs nvolves the analyss of bologcal data. So very recently, the feld of bonformatcs has been rapdly evolvng, not only due to the mpact of the varous genome proects, but also wth the development of expermental technologes, such as mcroarrays for gene expresson analyses and mass spectrometry for detecton of proten-proten nteractons. Currently bonformatcs s beng ncreasngly wdely vewed as a more fundamental dscplne that also encompasses mathematcs, statstcs, physcs and chemstry. Further, the feld s already lookng forward to a systems bology approach and to smulatons of whole cells wth ncorporaton of more levels of complexty. A recent edtoral n the ournal Bonformatcs noted that a maor upcomng challenge for the bonformatcs communty [s] to adopt a more statstcal way of thnkng and to nteract more closely wth statstcans. The stated goal for many researchers s for developments n Bonformatcs to be focused at fndng the fundamental laws that govern bologcal systems, as n physcs. However, f such laws exst, they are a long way from beng determned for bologcal systems. Instead the current am s to fnd nsghtful ways to model lmted components of bologcal systems and to create tools whch bologsts can use to analyze data. Examples nclude tools for statstcal assessment of the smlarty between two or more DNA sequences or proten sequences, for fndng genes n genomc DNA, for quanttatve analyss of functonal genomcs data, and for estmatng dfferences n how genes are expressed n say dfferent tssues, for analyss and comparson of genomes from dfferent speces, for phylogenetc analyss, and for DNA sequence analyss and assembly. Tools such as these nvolve statstcal modelng of bologcal systems. Although the most relable way to determne a bologcal molecule s structure or functon s by drect expermentaton, there s much that can be acheved n vtro,.e. by obtanng the DNA sequence of the gene correspondng to an RNA or proten and analyzng t, rather than the more laborous fndng of ts structure or functon by drect expermentaton. Much bologcal data arse from mechansms that have a substantal probablstc component, the most sgnfcant beng the many random processes nherent n bologcal evoluton, and also from randomness n the samplng process used to collect the data. Another source of varablty or randomness s ntroduced by the

3 botechnologcal procedures and experments used to generate the data. So the basc goal s to dstngush the bologcal sgnal from the nose. Today, as expermental technques are beng developed for studyng genome wde patterns, such as expresson arrays, the need to approprately deal wth the nherent varablty has been multpled astronomcally. For example, we have progressed from studyng one or a few genes n comparatve solaton to beng able to evaluate smultaneously thousands of genes (or expressed sequence tags) Not only must methodologes be developed whch scale up to handle the enormous data sets generated n the post-genomc era, they need to become more senstve to the underlyng bologcal knowledge and understandng of the mechansms that generate the data. For bometrcans, research has reached an exctng and challengng stage at the nterface of computatonal statstcs and bology. The need for novel approaches to handle the new genome-wde data (ncludng that generated by mcroarrays) has concded wth a perod of dramatc change n approaches to statstcal methods and thnkng. Ths quantum change has been brought about, or even has been drven by, the potental of ever more ncreasng computng power. What was thought to be ntractable n the past s now feasble, and so new methodologes need to be developed and appled. Unfortunately too many of the current practces n the bologcal scences rely on methods developed when computatonal resources were very lmtng and are often ether (a) smple extensons of methods for workng wth one or a few outcome measures, and do not work well when there are thousands of outcome measures, or (b) ad-hoc methods (that are commonly referred to as statstcal or computatonal, or more recently data mnng! methods) that make many assumptons for whch there s often no (bologcal) ustfcaton. The challenge now s to creatvely combne the power of the computer wth relevant bologcal and stochastc process knowledge to derve novel approaches and models, usng mnmal assumptons, and whch can be appled at genomc wde scales. Such technques comprse the foundaton of bonformatc methods n the future. 2. Bologcal Sequence Analyss 2.1. Background Wth the advent of whole genomes becomng avalable for many speces, as well as many other (usually extremely large) databases, such as proten sequence databases, ncreasngly bologsts are askng questons about whether some query sequence of nterest to her or hm s sgnfcantly smlar to some other sequence/s n one (or more) of these databases. If some of these smlar sequences are lkely to correspond to genes or protens wth known functons, then by assocaton t s nferred that the functon of the query sequence s related. Essentally an evolutonary model s assumed where the two sequences have dverged from some common ancestor by the process of mutaton and selecton. Mutatons can be such as to change one nucleotde to another n a DNA sequence or change one amno acd to another n a proten. Natural selecton s a type of screenng process that favors neutral or advantageous mutatons, nsertons and deletons compared wth more deleterous mutatons, nsertons and deletons.

4 Consder the followng two sequences u1, u2,, un 1 v1, v2,, vn 2 where u and v represent the elements of the set {A,G,C,T} n the case of DNA sequences, and have the letters from the set of 20 amno acds n the case of proten sequences. Comparsons of two sequences usually cannot dstngush between whether a deleton has occurred n one sequence or an nserton n the other. Insertons and deletons are referred to as gaps, and n scorng an algnment these are penalzed by assgnng them a cost. The most wdely used program s BLAST (Basc Local Algnment Search Tool) and ts varants, and the more heurstc algorthm FASTA s also wdely used. Ths method starts wth an ntal word search - fndng the best and longest contnuous set of ungapped words n a database - as a startng pont. The length of ths - the ntal k-tuple or word sze - s the prmary determnant of the speed and senstvty of the search. In the remander of ths secton we overvew sequence analyss from the vewpont of basc BLAST search, separately dscussng the three steps, namely () fndng a scorng system for comparng elements of two sequences () fndng optmal algnments, () assessng sgnfcance Scorng Systems For comparng DNA sequences, smple nucleotde scores can be used, such as + m for a match, m' otherwse (that may be equal to m ), correspondng to a smple evolutonary model n whch all nucleotdes are equally common and all substtutons are equally lkely. However, f the sequences of nterest code for proten, t s usually better to compare the proten translaton to amno acds, snce, after only a small amount of evolutonary change, f smple nucleotde substtutons are used there s less nformaton wth whch to deduce homology. Substtuton matrces are used for amno acd algnments. These are matrces n whch each possble resdue substtuton s gven a score reflectng the probablty that t s related to the correspondng resdue n the query. The algnment score wll be the sum of the scores for each poston n the algned sequence. These scores are obtaned as follows. The null hypothess s that u and v occur ndependently, and hence the probablty of the two sequences s pu p v. The alternatve hypothess s that the two sequences have dverged from the same ancestor. Denotng the probablty that u and v have evolved ndependently from w k as p uv then the probablty for the whole algnment s p uv. The rato of these two probabltes gves the lkelhood rato comparng the two hypotheses, and to obtan an addtve scorng system, logarthms are taken, so su (, v) = log( pu ) v p u p v (1)

5 For protens we have a matrx, and the commonly used scorng matrces are PAM and BLOSUM matrces. Elements n the accepted pont mutaton PAMn matrces are computed from a Markov chan model of the replacement of one amno acd by another followng ts ntal occurrence by mutaton, takng nto account the frequences of the amno acds, ther respectve mutabltes and the probabltes for replacement of amno acds. A PAM1 substtuton matrx has the property that t s derved from a Markov chan for whch the average probablty of a change from one amno acd to another n the chan s A PAMn substtuton matrx s found from the n th power of the Markov chan transton matrx that gave the PAM1 substtuton matrx. Now the elements n the (symmetrc) BLOSUMx matrx are found by clusterng proten sequences nto blocks of algned sequences such that they have x% dentty, and then an estmated and rounded log lkelhood rato s determned. The BLAST web page at NCBI gves the matrces for x = 45, 62 & 80. The 20 dagonal elements are all postve, whle the 190 dstnct off-dagonal elements (.e. parwse comparsons) are generally negatve. Note, the larger n s for a PAM matrx the longer s the evolutonary dstance, whereas for BLOSUM matrces, smaller values of x correspond to longer evolutonary dstance. In choosng to use, say a PAMn (or BLOSUMx) substtuton matrx, an assumpton s beng made about the value of n (or x) and hence an mplct assumpton about how long n the past the most recent common ancestor exsted. Such an assumpton s most unlkely to be correct, especally snce databases contan nformaton about many speces, and the most recent ancestor of these varous speces wth the speces of the query sequence mght vary markedly. Commonly chosen values are x = 62, n = 120 or 250. Problems can arse f too small a value of n s chosen. For example, suppose that wth the correct choce n' the probablty that u and v have evolved ndependently from w k s p ' uv, then the mean score s proportonal to p ' uv log ( p' ) uv p u p v (2), As n ', the mean score becomes negatve, and the more negatve ths value, the more lkely t s that the null hypothess wll be accepted. For example, n the smple symmetrc model (descrbed later), f the value n = 100 s chosen, the mean score s negatve for n' 193. Alternatvely f too small a value of n s chosen the test s less powerful. Note that tryng to overcome ths problem by choosng a varety of substtuton matrces leads to one of the many multple testng problems nherent n the use of BLAST (and dscussed further below). When comparng non-codng DNA sequences, a more complex model than the above smple evolutonary model, n whch transtons are more lkely than transversons, yelds dfferent msmatch scores for transtons and transversons. The best scores to use wll depend on whether one s comparng relatvely dverged or closely related sequences.

6 2.3. Sequence Algnment Gven a scorng system, next we need an algorthm to fnd an optmal algnment for a par of sequences. Algnments can be global or local. To fnd the global algnment of proten sequences, Needleman & Wunch developed a dynamc programmng algorthm, and there have been many extensons. The most wdely used local algnment method s that gven by Smth & Waterman (S-W) by frst constructng a matrx M ndexed by and correspondng to u and v n our sequences. Let S be the score of the optmal algnment between the subsequence up to u, and the subsequence up to v. Then the S- W algorthm s gven by S max(0, S + s( u, v ), S = 1, 1 + gap penalty, 1,, 1 + gap penalty) S The score S can never become negatve, and hence always there wll be areas of smlartes even f there are long msmatches or gaps n between Assessment of Sgnfcance Overvew of Basc Theory Havng obtaned an optmal algnment, the next step s to assess ts sgnfcance, namely s ths a sgnfcantly good algnment,.e. match? In other words how does one test the null hypothess that there s no sgnfcant homology between the two sequences aganst the alternatve hypothess that there s sgnfcant homology? The basc theory underlyng the answer to ths queston was due to Sam Karln, and s based on random walk theory, on renewal theory and on asymptotc dstrbuton theory. Consder Table 1 that gves a smple case of two algned DNA sequences (of equal length). The bold poston numbers are those n whch the same nucleotde occurs n both sequences (.e. matches occur). Suppose we gve a score +m for a match and m where the nucleotdes are dfferent. Comparng the two sequences (startng from the left, wth value zero) we can fnd the accumulated value of the scores, namely m, 2m, m, 0, m,... (gven n bottom row of Table 1; alternatvely a graph representaton could be used). Ths s a smple random walk wth steps ± m. Ladder ponts are those that are lower than any prevously reached pont, and occur when the values 0, m, 2m, are frst reached n ths example at postons L 1 = 4, L 2 = 5, L 3 = 8,... (all shown n bold n the bottom row). (3) The term upwards excurson descrbes that part of the walk between two consecutve ladder ponts, L and L + 1, and nterest s n the maxmum heght startng from the frst ladder pont L. (Where ladder ponts fall at consecutve postons, such as 8 & 9 above, then ths value s zero.) The maxmum heght acheved by the varous excursons s the

7 bass for the test statstc to evaluate the homology of the two sequences. In the above example the heghts of the varous upward excursons are, respectvely, 0, 1, 0, 0, 3, 4, wth a maxmum value of G G T A C T G G G G G G G G C C T T C C m 2m m 0 -m 0 -m -2m -3m -4m A A C T T T T T C C A A C C G G T T A A -3m -2m -m -2m -3m -4m -3m -2m -3m -4m C G G G T A A A A T G G G G T A T C C C -5m -4m -3m -2m -m -2m -3m -4m -5m -6m Table 1: Two algned DNA sequences For comparson of two proten sequences, the scores are obtaned from the approprate (or selected) substtuton matrx, and the random walk s less regular than n the above example. Suppose the substtuton matrx used allocates a score S (, ) to a match of amno acds and. The null hypothess s that the two sequences are random wth respect to one another and under ths hypothess the mean score s, S (, pp ) where p s the frequency of amno acd. For the BLAST procedure t s necessary that ths mean score be negatve, and we assume t s, so that the general trend of the random walk (startng from the left) s downward, passng through a sequence of ncreasngly negatve ladder ponts. The test statstc s the heght Y max of the largest upwards excurson followng a ladder pont relatve the poston of that ladder pont, before the walk reaches the next ladder pont. It can be shown, usng standard random walk theory, that f Y s the maxmum heght acheved by the walk after reachng any ladder pont and before reachng the next, then Pr( Y y) ~ Ce λ y (4) where λ s the unque postve soluton of the moment generatng equaton S(, ) ppe λ = 1 (5), Ths s referred to as a geometrc-lke dstrbuton. For the smple random walk above, λ = log (1 p) p and lettng p (<1/2) be the probablty of a postve step, we obtan [ ] C = 1 e λ. The formula for C for a gven substtuton matrx s more complcated (and

8 not gven here). We are nterested n Y max, the largest of these maxmum heghts. Unfortunately there s no lmtng probablty dstrbuton for Y max, although bounds can be found. Next consder the number of ladder ponts reached by the walk before t fnshes (after n steps have been taken where n s the length of the sequences beng compared). The theory s complex, but for our purposes the number of ladder ponts can be taken as na where A s the mean number of steps taken from one ladder pont to the next (and can be found by random walk theory or by applcaton of Wald's dentty), and also s taken here as gven. Then t can be shown that the P-value assocated wth an observed value y max of Y max s approxmately max ( Ce λ y ) na 1 1 (6) and n BLAST ths s approxmately 1 exp( e s ), where s = λ ymax log nk and λ K = Ce A. When s s large, P-value ~ e s. Note that f the elements n the chosen substtuton matrx are all multpled by some constant k, then t can be shown that the P-value s not altered. Hence the quantty s s called a normalzed score. The BLAST software also gves an Expect value. Ths s the mean number of walks to reach a heght equal to or hgher than the maxmum heght observed when the null hypothess s true. When the P-value s small, ths s close to the P-value, otherwse t s close to log(1 P-value). Note that the P-values are qute senstve to the somewhat arbtrary numercal values of K and λ Bblography TO ACCESS ALL THE 24 PAGES OF THIS CHAPTER, Vst: Bald, P. and Brunak, S. (2001). Bonformatcs: The Machne Learnng Approach, 2 nd Ed., 452 pp, USA: The MIT Press. [Ths gves more background to the materal of Sectons 3 and 4, and covers areas of bonformatcs that have appled the machne learnng approach.] Durbn, R., Eddy, S., Krogh, A. and Mtchson, G. (2000). Bologcal Sequence Analyss: Probablstc Models of Protens and Nuclec Acds, 356 pp, UK: Cambrdge Unversty Press. [Ths concentrates on applcatons of hdden Markov models n bonformatcs.] Elston, R., Olson, J., and Palmer, L. (2002). (eds) Bostatstcal Genetcs and Genetc Epdemology, 831 pp, UK: John Wley. [Ths ncludes artcles on bonformatcs, on DNA sequences, and on sequence analyss; updates wll appear n the Encyclopeda of Bostatstcs 2 nd Edton, 2004.]

9 Ewens W.J. and Grant, G.R. (2001). Statstcal Methods n Bonformatcs. 476 pp, USA: Sprnger- Verlag. [Ths ncludes more background on the materal n Sectons 2, 3 and 4, especally the theory underpnnng BLAST.] Mandonald, J.H., Pttelkow, Y.E. and Wlson, S.R. (2003). Some consderatons for the desgn of mcroarray experments. Scence and Statstcs, Vol. 40 (ed. D.R. Goldsten), USA: Insttute of Mathematcal Statstcs. [Ths dscusses ssues relevant for the desgn of mcroarray experments wth emphass on uses of replcaton and the mportance of dentfyng maor sources of varaton.] Speed, T. (2003). (ed.) Statstcal Analyss of Gene Expresson Mcroarray Data, 222 pp, UK: Chapman & Hall. [Ths s a small collecton of papers manly concerned wth analyss of gene expresson data.] Bographcal Sketch Sue Wlson was born s Sydney, Australa; she obtaned her B.Sc. from the Unversty of Sydney (1969), followed by her Ph.D. from the ANU (awarded 1973). Sue then spent two years as a Lecturer n the Department of Probablty and Statstcs at Sheffeld Unversty. She returned to ANU towards the end of 1974 and has snce held a varety of postons there, both n some of the Statstcal groupngs, as well as at the Natonal Centre for Epdemology and Populaton Health. She s currently Professor, Statstcal Scence Program, Centre for Mathematcs and ts Applcatons, Mathematcal Scences, Insttute and Co- Drector, Centre for Bonformaton Scence (ont wth John Curtn School of Medcal Research), at the Australan Natonal Unversty (ANU). Sue has over 150 publcatons n bometry and appled statstcs, wth a partcular emphass recently on statstcal genetcs/genomcs and bonformatcs. These papers have arsen from her extensve consultng experence n the bologcal, socal, and medcal scences, leadng to statstcal modelng developments to answer substantve research questons n these dscplnes. Sue s an elected member of the Internatonal Statstcal Insttute, a Fellow of the Amercan Statstcal Assocaton and a Fellow of the Insttute of Mathematcal Statstcs (IMS). She was Presdent, Internatonal Bometrc Socety, 1999 & 2000 (Vce Presdent, IBS, 1998, 2001). Currently she s Assocate Edtor, Annals of Human Genetcs; Assocate Edtor, Computatonal Statstcs and Data Analyss; Member, Edtoral Board, Statstcal Methods n Medcal Research; Member, Edtoral Board, Behavor Genetcs.

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh Computatonal Bology Lecture 8: Substtuton matrces Saad Mnemneh As we have ntroduced last tme, smple scorng schemes lke + or a match, - or a msmatch and -2 or a gap are not justable bologcally, especally