Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Computatonal Bology Lecture 8: Substtuton matrces Saad Mnemneh As we have ntroduced last tme, smple scorng schemes lke + or a match, - or a msmatch and -2 or a gap are not justable bologcally, especally or amno acd sequences (protens). Instead, more elaborated scorng unctons are used. These scores are usually obtaned as a result o analyzng chemcal propertes and statstcal data or amno acds and DNA sequences. For example, t s known that same sze amno acds are more lkely to be substtuted by one another. Smlarly, amno acds wth same anty to water are lkely to serve the same purpose n some cases. On the other hand, some mutatons are not acceptable (may lead to demse o the organsm). PAM and BLOSUM matrces are amongst results o such analyss. We wll see the technques through whch PAM and BLOSUM matrces are obtaned. Substrtuton matrces Chemcal propertes o amno acds govern how the amno acds substtue one another. In prncple, a substrtuton matrx s, where s j s used to score algnng character wth character j, should relect the probablty o two characters substtung one another. The queston s how to buld such a probablty matrx that closely maps realty? Derent strateges result n derent matrces but the central dea s the same. I we go back to the concept o a hgh scorng segment par, theory tells us that the algnment (ungapped) gven by such a segment s governed by a lmtng dstrbuton such that where: s s the substuton matrx used q j = e λsj q j s the probablty o observng character algned wth character j p s the probablty o occurrence o character Thereore, s j = λ ln q j Ths ormula or s j suggests a way to constrcut the matrx s. I hgh scorng algnments are to be real, {q j } represent the desred probabltes o substtons, whle {p } represent the background probabltes o occurrence. By observng related amles o sequences, one could estmate q and p and hence obtan the matrx s by usng some scalng actor λ. Note that λ,j s j =,j ln q j =,j p j ln q j p j Inormaton theory tells us that the above sum s strctly less than 0 p q, whch s a desred property. Also, note that the score S n bts o a gven segment (see prevous lecture) s S = log eλs K Snce S s a sum o terms o the orm λ ln q j, e λs s a product o terms o the orm q j. Thereore, S relect the log lkelhood o observng the algnment due to substtuton (governed probablstcally by q) relatve to smply by chance (governed probablstcally by p). Dvson by the constant K adjusts the score to account or the rate o observng maxmal scorng segments as descrbed n the prevous lecture.

Here s another ntutve approach that justes the above scheme. To construct a substtuton matrx to score proten algnments, a amly o protens can be consdered, and a multple algnment o all the proten sequences n the amly s obtaned. Agan, we are consderng algnments wth no gaps; thereore, we assume sequences have the same length (a vald assumpton they are related) and, thereore, the multple algnment s trval. or any par o amno acds and j we need q j, the probablty o observng algned wth j (whch s same as q j ), and p, the probablty o observng an. The queston s, n an algnment (ungapped) o sequences x and y that algns two o ther amno acds and j, dd ths happen by chance or was ndeed because o a mutaton rom to j or vce versa? To capture ths complementary behavors, we consder two models: M, where x and y are related and obtaned accordng to the jont probabltes q j, and R, where x and y are unrelated and obtaned ndependently at random accordng to the ndvdual probabltes p and p j. Consderng ths, now the score s the lkelhood that the sequences are related compared relatve to them beng unrelated. Ths s called the odds rato and s mathematcally expressed as: P (x, y M) score(x, y) = P (x, y R) = q x y p x p. y Ths ormula says that the score o the (ungapped) algnment s the probablty that the symbols o x and y are algned because they are related, relatve to the probablty o ther symbols beng algned just by chance. For two algned amno acds and j, we take s pont o vew, what s the probablty to see a j on the other sde? Well, ths s the probablty that an mutated nto a j, p( j). However, there s a mere chance o p j or a random occurence o a j as well. Hence, the probablty rato p( j) p j relects how much beleves that ths j s related to t. Now, snce q j = = p p( j) (the probablty o observng an and ts mutated orm), we can also express the lkelhood as: q j p( j) = p j. By dong ths or every par o algned symbols n the algnment and ndng the product o the terms, we obtan the ormula above, whch relect how much we beleve that the two entre sequences are related. In all the algnment algorthms we have seen so ar, we reled on the act that the score s addtve, and ths was a key property or the dynamc programmng to work. In ths case - when the score s computed as,j multplcatve. In order to make t addtve we can take the log and compute: log,j =,j log q x y. Ths s called the log-odds rato. Thereore, the values makng up the sum wll be the ndvdual scores ound n the matrx, hence s j = log qj up to some scalng actor. Note that ths s symmetrc, so scorng algned wth j s the same as scorng j algned wth, hence, the drecton o the algnment s not mportant (but one could n prncple make a dstncton needed). Now the mportant queston: how to compute p, p j, and q j? We re gong to look at two ways o computng ths: PAM and BLOSUM matrces. PAM (Pont Accepted Mutatons) matrces q x y q x y - t s PAM stands or Pont Accepted Mutatons. An accepted mutaton s dened as a mutaton that was postvely selected by the envronment and dd not caused the demse o the organsm. A PAM matrx M holds the probablty o beng replaced by j n a certan evolutonary tme perod. The longer the evolutonary perod o tme, the more dcult t s to determne the correct values. The reason beng that could mutate several tmes beore becomng a j, and t wll be hard to capture all these ntermedate mutatons, snce we only observe and j. What we are gong to do s look over mutatons that occurred n a relatvely short evolutonary perod o tme. One unt o evoluton s dened to be the amount o evoluton that changes, on the average, n 00 amno acds. Consderng ths unt, a -PAM matrx s rst computed. Usng ths as a startng pont, a k-pam matrx can be generated rom the -PAM matrx. For a -PAM matrx M, M j s gong to be p( j) scaled by a actor, such that the expected number o mutatons s 0.0; n other words t s the same as havng the probablty o n 00 or a mutaton to occur. The computatonal steps that lead to the -PAM matrx are: 2

Compute p or every. Compute p( j) or every par and j and let M j = p( j). Scale M such that the expected number o mutatons p ( M ) s 0.0. M Use s j = 0 log j 0 p j to obtan the addtve scores. s j s rounded to an nteger and here, the scalng actor 0 s used just to provde a better nteger approxmaton. Next we ll take a closer look at each o these steps. Let the requency count j be the number o tmes s algned wth j countng both drectons. Then, let the number o occurrences o, = j j, and the count o all characters =. Now, we can estmate p j = j, whose meanng s smply the rate at whch was ound to be algned wth j. Smlarly, the rate o ndng an occurence o s p =. Now, havng both p j and p determned, the elements o the matrx M are beng computed as: M j = p( j) = pj p. M s ndeed a probablty matrx, and ths can be proved by notng that j M j =. To llustrate ths step o computng a matrx M, let s have a quck example. Let the algnment be: In ths case, the requences are: hence the estmated probabltes: A B A A AB = BA = AA = 2 A = X AX = AB + AA = B = X BX = BA = = X X = 4 The matrx M wll be: The expected number o mutatons s X p AB = AB = 4 p BA = BA = 4 p AA = AA = 2 p A = A = 4 p B = B = 4 p(a B) = p AB p A = p(b A) = p BA p B = M = [ 2 0 p X ( M XX ) = p A ( M AA ) + p B ( M BB ) = 4 ( 2 ) + ( 0) = 0.5 = 50% 4 The next step s the scalng o M such that t s consstent wth the denton o a -PAM matrx: n 00 expected mutatons. Suppose matrx M s elements, M j, are scaled by a actor α. In ths case the new values become M j = αm j. Ths wll change the values o the row sums such that j M j = α. Snce we want a probablty matrx - every row sums up to - a small adjustment s needed: we wll add α to every element on the man dagonal: ] M j = αm j, j M = αm + α Ths wll restore the property o a probablty matrx. Now what should α be? Let s compute the new expected number o mutatons: p ( M ) = p ( αm + α) = α p ( M )

Ths s just α multpled by the old expected number o mutatons. Thereore, we can set α approprately. For nstance, n the example above, α = 0.02. Havng a -PAM matrx computed, the queston s how to compute a 2-PAM matrx? In other words, what s the probablty p 2 ( j) o mutatng nto j n two unts o evoluton. Ths s the probablty o mutatng nto k, or some k, n the rst unt o evoluton, and then, k mutatng nto j n the second unt o evoluton. Mathematcally, ths can be expressed as: p 2 ( j) = k p( k)p(k j) = k M k M kj Ths s the ormula used to obtan the entry correspondng to the par and j when multply M by tsel. Hence, the 2-PAM matrx s just M 2. An analogous step s used to show that the k-pam matrx s the same as M k. When workng wth a k-pam probablty matrx the score wll be computed n the same way: s k j = 0 log M k j 0 p j. The only change s that now the values o M k are plugged nstead o those o M. BLOSUM (BLOCKS Substtuton Matrces) matrces As mentoned earler, BLOSUM are another type o matrces used n scorng sequence algnments. They are ntended to be used or scorng smlartes o proten sequences that are evolutonary ar apart (dstant). Computng ther values s done usng the normaton stored n a database o blocks (called the BLOCKS database) where each block s a multple ungapped algnment o related proten sequences. The sequences o each block are clustered, puttng two sequences nto the same cluster ther percentage o matchng algned resdues - or level o smlarty - s above a certan threshold L%. We dene two sequences to be dstant they all n derent clusters. Thereore, two dstant sequences der by at least (00 L)%. The computaton o BLOSUM-L, or a partcular value o L, s based on countng the number o mutatons among dstant sequences only. Thereore, lower values o L correspond to longer evolutonary tmes, and are applcable or more dstant sequences. As explaned above, n computng a BLOSUM-L matrx s entres, we want to count the number o mutatons between dstant sequences only - the ones that are less than L% smlar. The value ab s the relatve requency o seeng a algned wth b. Whenever such an algnment s observed or two sequences that are n derent clusters, ab s ncremented by n n 2, where n and n 2 are the szes o the two clusters (we scale by the sze o the cluster snce larger clusters are more lkely to contan mutatons). The steps through whch a matrx s computed are: Estmate p = j j ; Estmate q j = k,l j j k,l kl ; BLOSUM-L(,j)= log q j wth some scalng actor λ. Consder an example where sequences are generated at random (so we are not usng the BLOCKS database here) such that p A = p G = p C = p T = 4 and the level o smlarty s 50%,.e. the probablty that two algned resdues are the same s 0.5. Then L = 50%, we expect to have one cluster, where: and p AA = p GG = p CC = p T T = 0.5 4 = 8 p AG = p AC = p AT = p GA = p GC = p GT = p CA = p CG = p CT = p T A = p T G = p T C = 0.5 2 = 24 Then a match wll have a score and a msmatch wll have a score m = log /8 /4./4 = s = log /24 /4./4 = 0.585 4

Reerences Setubal J., Medans, J., Introducton to Molecular Bology, Chapter. Drubn R. et al., Bologcal Sequence Analyss, Chapter 2. 5