be the i th symbol in x and

Size: px

Start display at page:

Download "be the i th symbol in x and"

Randolph Lewis
5 years ago
Views:

1 2 Parwse Algnment We represent sequences b strngs of alphetc letters. If we recognze a sgnfcant smlart between a new sequence and a sequence out whch somethng s alread know, we can transfer nformaton out structure and functon to the new sequence. We wll stud parwse algnment n ths secton. The ke ssues nclude: the scorng sstem used to rank algnments; the algorthm used to fnd optmal scorng algnments; and the statstcal methods used to evaluate the sgnfcance of an algnment score. 2.1 The scorng model When we compare sequences, we are lookng for evdence that the have dverged from a common ancestor b a process of mutaton and selecton. The basc mutatonal processes are substtutons, whch change resdues n a sequence, and nsertons and deletons, whch add or remove resdues. Insertons and deletons are together referred to as gaps. We defne the score of an algnment to be the logarthm of the relatve lkelhood that the sequence par are related, compared to beng unrelated. And the score wll be a sum of terms for each algned par of resdues, plus terms for each gap. Usng an addtve scorng scheme corresponds to an assumpton that we consder resdues at dfferent stes n a sequence to have occurred ndependentl. The ndependence assumpton s a smple and reasonle approxmaton for sequences. Let us estlsh some notaton. We wll be consderng a par of sequences, x and, of lengths n and m, respectvel. Let x be the th smbol n x and be the th smbol of. These smbol wll come from some alphet A; n the case of DNA ths wll be the four bases {A,G,C,T}, and n the case of protens the twent amno acds. We denote smbols from ths alphet b lower-case letters lke a, b. Let us frst consder ungapped global parwse algnments, that s, two completel algned equallength sequences. The unrelated or random model R gves the problt P ( x, R) = qx q (2.1) 1

2 wth the assumpton that letter a occurs ndependentl wth frequenc q a. In the alternatve match model, algned pars of resdues occur wth a ont problt value p can be thought of as the problt that the resdues a and b have been derved from a common ancestor. The problt for the whole algnment s P ( x, M ) = px. The rato of these two lkelhoods s known as the odds rato: p. Ths P( x, M ) = P( x, R) p x = q q x We take the logarthm of ths rato (log-odds rato) to obtan S = s( x ), (2.2) where q p x x q. p s( a, b) = log (2.3) qaqb s the log lkelhood rato of the resdue par ( a, b) occurrng as an algned par, as opposed to an unalgned par. Equaton (2.2) s a sum of ndvdual scores s ( a, b) for each algned par of resdues. The s ( a, b) scores can be arranged n a matrx. For protens, the form a matrx. Ths s known as a score matrx or a substtuton matrx. An example of a substtuton matrx s the BLOSUM50 matrx shown n Fgure 2.2. It s derved as ove, b the matchng probltes of pars of resdues. In fact, an substtuton matrx s makng a statement out the problt of observng pars n real algnments. We expect to penalze gaps. The standard cost assocated wth a gap of length g s gven ether b a lnear score γ ( g) = gd (2.4) or an affne score γ ( g ) = d ( g 1) e (2.5) where d s called the gap-open penalt and e s called the gap-extenson penalt. Usuall, we set e < d, allowng long nsertons and deletons to be penalzed less. Gap 2

3 penaltes also correspond to a problstc model of algnment. We assume the problt of a gap occurrng at a partcular ste n a gven sequence s P (gap) = f ( g) (2.6) n gap When we take log rato of the problt over the random model, the q x q x terms cancel out. Thus the gap penaltes correspond to the log problt of a gap of the length, γ ( g ) = log( f ( g)). 2.2 Algnment algorthms Gven a scorng sstem, we need to have an algorthm for fndng an optmal algnment for a par of sequences. When we use an addtve algnment score, the algorthm s called dnamc programmng. Dnamc programmng algorthms are central to computatonal sequence analss. The are guaranteed to fnd the optmal scorng algnment or set of algnments. We wll use two short amno acd sequences to llustrate the algnment methods, HEAGAWGHEE and PAWHEAE. We use the BLOSUM50 score matrx, and a gap cost per unalgned resdue of d = 8. Global algnment: Needleman-Wunsch algorthm The dea s to buld up an optmal algnment usng prevous solutons for optmal algnments of smaller subsequences. We construct a matrx { F (, } 1,..., n, = 1,..., m =, where F (, s the score of the best algnment between the ntal segment x 1,..., of x up to x and the ntal segment 1,..., of up to. We can buld F (, recursvel. We begn b ntalzng F ( 0,0) = 0. We then proceed to fll the matrx from top left to bottom rght. There are three possble was that the best score, of an algnment up to x could be obtaned: x could be algned to, n whch case F (, = 1, 1) x ) ; or x s algned to a gap, n whch case, = 1, d ; or s algned to a gap, n whch case, =, 1) d. These three cases are shown n the example below: 3

4 IGA x LGV AIGA x GA x -- GV -- SLGV (, wll be the largest of these three optons. The best score up to ) Therefore, we have 1, 1) x ),, = max 1, d,, 1) d. (2.8) Ths equaton s appled repeatedl to fll n the matrx of F (, values. The followng fgure dspla explctl. 1, 1) 1, x d ), 1) d, As we fll n the F (, values, we also keep a ponter n each cell back to the cell from whch ts F (, was derved, as shown n the example of the full dnamc programmng matrx n Fgure 2.5. We have to deal wth some boundar condtons. Along the top row, where = 0, the values F (,0) represent algnments of a prefx of x to all gaps n, so we can defne,0) = d. Lkewse, 0, = d. The value n the fnal cell of the matrx, F ( n, m), s b defnton the best score for an algnment of x 1,..., n to,..., m 1, whch s the score of the best global algnment of x to. To fnd the algnment tself, we must fnd the path of choces that led to ths fnal value. The procedure for dong ths s known as a traceback. It works b buldng the algnment n reverse, startng from the fnal cell, and followng the ponters that we stored when buldng the matrx. At each step n the traceback process we move back from the current cell (, to the one of the cells ( 1, 1), ( 1, or (, 1) from whch the value F (, was derved. At the same tme, we add a par of smbols onto the front of the current algnment: x and f the step was to ( 1, 1), x and the gap character - f the step was to ( 1,, or - and s the step was to (, 1). At the end we 4

5 wll reach the start of the matrx, = = 0. An example of ths procedure s shown n Fgure 2.5. HEAGAWGHE-E --P-AW-HEAE Note that n fact the traceback procedure fnds ust one algnment wth the optmal score; f at an pont two of the dervatons are equal, an arbtrar choce s made between equal optons. The reason that the algorthm works s that the score s made of a sum of ndependent peces, so the best score up to some pont n the algnment s the best score up to the pont one step before, plus the ncremental score of the new step. Ths algnment algorthm s of order nm (or standard computers, order of 3 n algorthms are onl feasble for ver short sequences. 2 n ). Wth bologcal sequences and 2 n algorthms are feasble but a lttle slow, whle order of Local algnment: Smth-Waterman algorthm In global algnment, we are lookng for the best match between two sequences from one end to the other. A much more common stuaton s where we are lookng for the best algnment between subsequences of x and. Ths arses for example when two proten sequences ma share a common doman, or when comparng two ver hghl dverged sequences. The hghest scorng algnment of subsequences of x and s called the best local algnment. The algorthm for fndng optmal local algnments s closel related to that for global algnment. The algnment now can start anwhere n the algnment matrx. Therefore, we should not consder the cells wth negatve values of F (, for the best algnment. The recursve equaton becomes 0, 1, 1) x ),, = max (2.9) 1, d,, 1) d. Takng the opton 0 corresponds to startng a new algnment. If the best algnment up to some pont has a negatve score, t s better to start a new one, rather than extend the old one. 5

6 Moreover, a local algnment can end anwhere n the matrx, so nstead of takng the value n the bottom rght corner, F ( n, m), for the best score, we look for the hghest value of F (, over the whole matrx, and start the traceback from there. The traceback ends when we meet a cell wth value 0, whch corresponds to the start of the algnment. An example s gven n Fgure 2.6. Repeated matches The prevous algorthms gave the best sngle local match between two sequences. If one or both of the sequences are long, t s qute possble that there are man dfferent local algnments wth a sgnfcant score. An example would be where there are man copes of a repeated doman or motf n a proten. We brefl ntroduce here a method for fndng repeated matches. Ths method s asmmetrc: t fnds one or more nonoverlappng copes of sectons of one sequence (e.g. the doman or motf) n the other. Let us assume that we are onl nterested n matches scorng hgher than some threshold T. An example of the repeat algorthm s gven n Fgure 2.7. We start b ntalzng F ( 0,0) = 0. But F (,0) now s the best sum of scores to the subsequence x 1,...,, wth a repeat begnnng to match sequence. The recursve equatons are below: 1,0),0) = max (2.11) 1, T, = 1,..., m,0), 1, 1) x ),, = max 1, d,, 1) d. (2.12) Equaton (2.11) handles unmatched regons and ends of matches, onl allowng matches to end when the have score at least T. Equaton (2.12) handles starts of matches and extensons. The total score of all the matches s obtaned b addng an extra cell to the matrx, F ( n +1,0), usng (2.11). 6

7 2.3 Dnamc programmng wth more complex models So far we have onl consdered the smplest gap model, n whch the gap score γ (g) s a smple multple of the length. Ths tpe of scorng scheme s not deal for bologcal sequences: t penalzes addtonal gap steps as much as the frst, whereas, when gaps do occur, the are often longer than one resdue. If we are gven a general functon for γ (g) then we can stll use all the dnamc programmng wth adustments to the recurrence relatons as tpfed b the followng: 1, 1) x ),, = max k, + γ ( k), k = 0,..., 1, (2.15), k) + γ ( k), k = 0,..., 1. However, ths procedure now requres order of length n, rather than order of appl n most of cases. 3 n operatons to algn two sequences of 2 n for the lnear gap cost. Thus t prevents the algorthm to Algnment wth affne gap scores For the affne gap cost structure γ ( g ) = d ( g 1) e, there s an order 2 n mplementaton of dnamc programmng. However, we now have to keep track of multple values for each par of resdue coeffcents (, n place of the sngle value F (,, to denote three separate stuatons: IGA x LGV AIGA x GA x -- GV -- SLGV Let M (, be the best score up to (, gven that x s algned to, I x (, be the best score gven that x s algned to a gap (n an nserton wth respect to ), and fnall I (, be the best score gven that s n an nserton wth respect to x. The recurrence relatons correspondng to (2.15) now become M ( 1, 1) x ), M (, = max I x ( 1, 1) x ), (2.16) I ( 1, 1) x ); I x M ( 1, d, (, = max I x ( 1, e; 7

8 I M (, 1) d, (, = max I (, 1) e. In these equatons, we assume that a deleton wll not be followed drectl b an nserton. As prevousl, we can fnd the algnment tself usng a traceback procedure. The sstem defned b equaton (2.16) can be descrbed ver elegantl b the dagram n Fgure 2.9. Ths shows a state for each of the three matrx values, wth transton arrows between states. An example of a short algnment and correspondng state path through the affne gap model s shown n Fgure Heurstc algnment algorthms So far all the algnment algorthms we have consdered are guaranteed to fnd the optmal score accordng to the specfed scorng scheme. In partcular, the affne gap versons descrbed n the last secton are generall regarded as provdng the most senstve sequence matchng methods avalle. However, the are not the fastest avalle sequence algnment methods, and n man cases speed s an ssue. A number of heurstc technques are avalle, for example BLAST and FASTA. The are faster and practcal algorthms used n publc datase. The BLAST package provdes programs for fndng hgh scorng local algnments between a quer sequence and a target datase. BLAST makes a lst all neghborhood words of a fxed length (b default 3 for proten sequences, and 11 for nuclec acds), that would match the quer sequence somewhere wth score hgher than some threshold. It then scans through the datase, and whenever t fnds a word n ths set, t starts a ht extenson process to extend the possble match as an ungapped algnment n both drectons, stoppng at the maxmum scorng extenson. 2.5 Sgnfcance of scores Now that we know how to fnd an optmal algnment, how can we assess the sgnfcance of ts score? That s, how do we decde f t s a bologcall meanngful algnment gvng evdence for a homolog, or ust the best algnment between two entrel unrelated sequences? There are two approaches. One s Baesan, n whch we calculate the posteror problt of match gven the algnment of x. We prevousl gave an 8

9 algnment score S based on the log odds raton of the lkelhoods of model and random model: x b match P( x, M ) S = log. P( x, R) Usng Baesan rule, we can calculate the problt P ( M x, ) wth more nformaton of the prors P (M ) and P (R). The log odds score of the posteror s actuall P( M ) S = S + log. P( R) An alternatve wa to consder sgnfcance uses a classcal statstcal framework. We can look at the dstrbuton of the maxmum of N match scores to ndependent random sequences. If the problt of ths maxmum beng greater than the observed test score s small, then the observaton s consdered sgnfcant. For local ungapped algnments, there s another approxmaton. The number of unrelated matches wth score greater than S s approxmatel Posson dstrbuted, wth mean where E λs = Kmne, λ, K are parameters. The problt that there s a match of score greater than S s then P E ( x > S) = 1 e. The E measurement s used n the report of BLAST algnment. Instead of raw score S, BLAST uses bt score, whch s a normalzaton of S b S b S ln K = λ ln 2 The E -value then becomes S E = mn2 b. 2.6 Dervng score parameters from algnment data In the secton of scorng model, we descrbed how to derve scores for parwse algnment algorthm from probltes. However, ths left open the ssue of how to estmate the probltes. A smple and obvous approach would be to count the frequences of 9

10 algned resdue pars and of gaps n confrmed algnments, and to set the probltes p, qa and f ( g) to the normalzed frequences. The wdel used BLOSUM matrx set were derved from a set of algned, ungapped regons from proten famles called the BLOCKS datase. The sequences from each block were clustered, puttng two sequences nto the same cluster whenever ther percentage of dentcal resdues exceeded some level L%. Then the frequences of observng resdue a n one cluster algned aganst resdue b n another cluster are calculated, correctng for the szes of the clusters b weghtng each occurrence b 1/( n 1n2 ), where n 1 and n2 are the respectve cluster szes. From A, the probltes are estmated b Then q p a = b = A A / / p s( a, b) = log. qaqb For L = 62 and L = 50 we get BLOSUM62 and BLOSUM50 substtuton matrces respectvel. BLOSUM62 s standard for ungapped matchng, and BLOSUM50 for algnment wth gaps. cd cd A cd A cd A 10

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh Computatonal Bology Lecture 8: Substtuton matrces Saad Mnemneh As we have ntroduced last tme, smple scorng schemes lke + or a match, - or a msmatch and -2 or a gap are not justable bologcally, especally