Likelihood vs. Information in Aligning Biopolymer Sequences. UCSD Technical Report CS Timothy L. Bailey

Likelihood vs. Infomation in Aligning Biopolyme Sequences UCSD Technical Repot CS93-318 Timothy L. Bailey Depatment of Compute Science and Engineeing Univesity of Califonia, San Diego 1 Febuay, 1993 ABSTRACT: Biopolyme sequences often contain egions of similaity with othe sequences due to homology o common function. A common method of discoveing pattens in biopolyme sequences is to align a set of sequences so that cetain columns of the alignment have highly non-andom esidue fequency distibutions. The patten can then be descibed in tems of a consensus patten, motif, pole, speci- city matix o egula expession. This eseach note shows that a commonly used method of measuing the \goodness" of an alignment based on infomation theoy is actually equivalent to maximizing the likelihood atio of two hypotheses when the assumed pobability distibution is multinomial. In addition, a method which has been used by othe wokes fo detemining whethe a new sequence contains the patten is shown to be essentially equivalent to a likelihood atio. This oes a new, unifom way of thinking about the infomation contained in a set of aligned sequences which is moe intuitive, and may aid the development of impoved algoithms. 1 Intoduction It is useful to discove pattens in biopolyme sequences such as DNA, RNA o poteins fo numeous easons. The pattens may shed light on the stuctue and function of the sequences. The pattens may also be used fo classifying new sequences as containing o not containing the patten. Pattens in biopolyme sequences ae known to exist that eect common evolutionay oigins of the sequences, common functions and common seconday and tetiay stuctue. Pattens in biopolyme sequences can be discoveed eithe by laboatoy expeiments o by examining sets of sequences known to shae a common function, stuctue o evolutionay oigin. Laboatoy expeiments ae expensive, so looking fo pattens in sets of sequences is an attactive altenative o supplement to lab wok. The pattens discoveed can be diectly infomative about the biopolymes, can be used to classify new biopolymes and can be used to diect futue laboatoy expeiments on biopolymes which appea to contain a patten of inteest. 1 Fo coespondence: Depatment of Compute Science and Engineeing, Univesity of Califonia, San Diego, La Jolla, Califonia 92093-0114, (619) 534-8187, tbailey@cs.ucsd.edu. 1

One method of examining sets of biopolyme sequences fo potential common pattens is to align them eithe by hand o by compute and then look fo columns which contain \unlikely" distibutions of esidues. Each sequence is teated as a sting of lettes ove the appopiate alphabet (i.e. A, C, G, T fo DNA sequences.) The sequences ae witten hoizontally and aligned so that the common egions in each sting stat in the same columns of the alignment. It may be necessay to inset gaps in some of the sequences in ode to accomplish the alignment. Usually, the sequences ae longe than the common patten they shae. Fo the puposes of this eseach note, it is assumed that the patten being seached fo is known o assumed to be W esidues long. The pattens discoveed by aligning a set of sequences can be descibed as consensus pattens [Chappey et al., 1991], motifs [Staden, 1990], poles [Gibskov et al., 1990], specicity matices [Hetz et al., 1990] o egula expessions. Hetz, Hatzell and Stomo [Hetz et al., 1990] descibe a successful pogam which automatically aligns sets of sequences, poduces a specicity matix descibing the discoveed patten and detemines how well new sequences match the patten. The pogam must scoe vaious possible alignments to detemine which alignment is best. It uses what I will call an \Alignment Scoe". It also must detemine if a new sequence matches the patten epesented by the optimum alignment. It computes a what I call a \match scoe" and compaes it to a theshold. The theshold is computed by computing the match scoe fo many sequences believed not to match the patten, and choosing a numbe lage than the maximum match scoe thus found. Hetz, Hatzell and Stomo's pogam uses an alignment scoe based on infomation theoy. This scoe was st descibed in [Schneide et al., 1986]. The total alignment scoe is the sum of the scoe fo each of the columns in the alignment window, whee W, the length of the window, is chosen in advance. Alignment Scoe = WX col=1 I(col) The column alignment scoe I(col) is a measue of how unlikely the obseved distibution of esidues in a given column of the alignment window is. The alignment scoe fo a single column is calculated as I(col) = f log2 f (1) p whee is a esidue, M is the numbe of dieent types of esidues (i.e., M = 4 fo DNA, M = 20 fo poteins), p is the genomic fequency of esidue (i.e., the a pioi estimate of the fequency of esidue ), and f is the fequency of esidue in column col of the aligned sequences. No deivation o motivation fo the column alignment scoe I(col) is given in eithe [Schneide et al., 1986] o [Hetz et al., 1990]. Pesumably its motivation is based on infomation theoetic aguments. It can be noted that I(col) is elated to the elative entopy of two pobability distibutions fo the esidues in a column. In paticula, if the esidues ae assumed to be equipobable, that is, p = 1=M fo 1 M, then I(col) = f log2 f p 2

= = f log2f? f log2p f log2f? Mlog2(1=M) = f log2f + Mlog2M = H(1=M)? H(f) whee H(1=M) is the entopy of a message with M equipobable esults and H(f) is the entopy of a message with M esults with pobabilities f i fo 1 i M. It is not clea to this autho what the meaning of I(col) is in tems of infomation theoy when the a pioi distibution is skewed (i.e., not p = 1=M fo 1 M.) An attempt to econstuct the motivation fo I(col) led to the eseach fo this note. To evaluate the stength of the match between the patten dened by an alignment and a new sequence of length W, [Hetz et al., 1990] use the sum of a match scoe fo each column in the alignment window. Match Scoe = WX col=1 Scoe(col) The column match scoe Scoe(col) measues how well the esidue in a column of the new sequence matches the pediction made by patten discoveed in the aligned sequences. The column match scoe fo a new sequence which has esidue in column col is n + 1 Scoe(col) = log2 (2) (N + 1)p whee N is the numbe of sequences being aligned, n is the numbe of times esidue occued in column col of the alignment, and p is the same as fo I(col). The motivation fo the column match scoe Scoe(col) is given in [Hetz et al., 1990] in tems of how much the pobability of the obseved fequency distibution would change if the new sequence wee added to the alignment. The pobability of obseving esidue exactly n times fo 1 M was assumed to be given by the multinomial distibution P = N! Q M(n )! It will be shown that maximizing I(col) (esp. Scoe(col)) is equivalent to maximizing the log-likelihood atio of two hypotheses given that the pobability model is the multinomial distibution ove N (esp. 1) independent tials. Fo I(col), the equivalence with a log-likelihood atio maximization is exact, fo Scoe(col) the equivalence is appoximate with the discepancy becoming smalle as N, the numbe of sequences in the alignment, inceases. Section 2 will demonstate the equivalence of the infomation-based and likelihoodbased alignment and match scoes. Section 3 discusses why the likelihood-based scoes make intuitive sense and the implications fo futue eseach on algoithms fo aligning biopolyme sequences. 3 p n f

2 Equivalence of Scoes Based on Infomation and Likelihood One method of choosing between two hypotheses given some obseved data uses the concept of the likelihood atio [Edwads, 1972]. In this method, you st choose a pobability model that is assumed to descibe the pocess that geneates the data. Competing hypotheses ae descibed in tems of paametes of the pobability model. The object is to nd the values of the paametes which ae best suppoted by the data. The method is to choose the values of the paametes which would be moe fequently geneate the obseved data. This occus when the value of the likelihood atio is geate than 1. The likelihood atio is dened in tems of the likelihood function. The likelihood function fo the multinomial distibution given some obseved data R is L(jR) = k P () n whee k is an abitay constant, M is the numbe of classes, P () is the pobability of a sample being in class on any given tial, and n is the numbe of samples duing N independent tials that belonged to class. The likelihood atio fo hypothesis 1 vesus 2 given data R is dened as L(1; 2jR) = L( 1) L(2) Fo the multinomial pobability model and hypotheses 1 and 2 such that the likelihood atio can be witten as P (1) = f ; 1 M P (2) = p ; 1 M f n L(1; 2jR) = p n 4

Theoem 1: Maximizing L(1; 2jR) is equivalent to maximizing I(col), whee the obseved data R ae the esidues in the given column of the aligned sequences. (Notice that this data is also used to compute the values n and f.) Poof: Since f(x) = x 1=N is monotonic, inceasing fo x 0, and L(1; 2jR) 0, maximizing L(1; 2jR) is equivalent to maximizing L(1; 2jR) 1=N f n = ( ) 1=N = p n Since log(x) is monotonic, inceasing, this is equivalent to maximizing f f log2( ) = p f f f p f f log2( f p ) = I(col) Theoem 2: Maximizing L(1; 2jR) is essentially equivalent to maximizing Scoe(col), whee the obseved data R is the single esidue that the new sequence has in the given column. Poof: Once again, we take the logaithm of the likelihood function and note that maximimizing the log-likelihood is equivalent to maximizing the likelihood log2(l(1; 2jR)) = log2 f p n + 1 log2 (N + 1)p = Scoe(col) The appoximation becomes bette as N inceases, since as N! 1, log2l(1; 2jR)! scoe(col) because (n + 1)=(N + 1)! f. 3 Discussion This eseach note has shown that a successful scoing system [Hetz et al., 1990] fo alignments of elated biolpolyme sequences based on infomation theoy is equivalent to a likelihood atio method. Also, a method of using a set of aligned sequences to evaluate whethe new sequences contain the same patten based on a pobabilistic agument is equivalent to the same likelihood atio method. The likelihood atio method has the advantages of making all of the assumptions upon which an infeence ae based explicit, and of being intuitively pleasing (at least to some.) It equies that the pobability model and altenative hypotheses be tested be clealy dened. The data is then used to detemine which hypothesis is bette suppoted. The likelihood atio can be intepeted opeationally as the elative fequency with which the obseved 5

data would be geneated by the two hypotheses [Edwads, 1972]. Using a single method fo justifying both the alignment and matching pocesses seems simple and moe open to analysis than a combination of infomation theoy and pobability. The pobability model used in this note fo the fequencies of esidues in a column of a set of aligned esidues is the multinomial distibution. This is a sensible model since each of the N sequences being aligned can be thought of as an independent sample. The two hypotheses which ae compaed in the likelihood atio in this note ae the hypothesis that the esidue pobabilities in the columns of the coectly aligned sequences ae the obseved esidue fequencies, vesus the hypothesis that the coect pobabilities ae the a pioi fequencies. Seaching fo the alignment with the highest likelihood atio can be viewed as looking fo the alignment such that the second hypothesis is ejected most stongly. This is a easonable way of detemining if a patten eally exists in the data that can be found by tying vaious alignments. The method of aligning sequences descibed in [Hetz et al., 1990] summed the scoes fo all the columns in the alignment window. Since the scoes ae equivalent to log-likelihood atios, this is equivalent to multiplying likelihood atios togethe. Futue eseach should examine the assumptions of independence among the columns of the alignment undelying this algoithm. It might also be useful to eplace Scoe(col) with the log-likelihood atio in cases whee thee ae few sequences, since that is when they will die the most. It would be inteesting to analytically dene the distibutions of I(col) and Scoe(col) in ode to set thesholds fo alignments and matches without esoting to lage sets of supposed negative examples. This might be easie using the likelihood fomalism than using the infomation theoetic and pobabilistic fomalisms of [Hetz et al., 1990]. 6

Refeences [Chappey et al., 1991] C. Chappey, A. Danckaet, P. Dessen, and S. Haxout. MASH: An inteactive pogam fo multiple alignment and consensus sequence constuction fo biological sequences. Compute Applications in Biosciences, 7(2):195{202, 1991. [Edwads, 1972] A. W. F. Edwads. Likelihood. Cambidge Univesity Pess, Cambidge, England, 1972. [Gibskov et al., 1990] Michael Gibskov, Roland Luthy, and David Eisenbeg. Pole analysis. Methods in Enzymology, 183:146{159, 1990. [Hetz et al., 1990] Geald Z. Hetz, Geoge W. Hatzell, III, and Gay D. Stomo. Identication of consensus pattens in unaligned DNA sequences known to be functionally elated. Compute Applications in Biosciences, 6(2):81{92, 1990. [Schneide et al., 1986] Thomas D. Schneide, Gay D. Stomo, Lay Gold, and Andzej Ehenfeucht. Infomation content of binding sites on nucleotide sequences. Jounal of Molecula Biology, 188:415{431, 1986. [Staden, 1990] Rodge Staden. Seaching fo pattens in potein and nucleic acid sequences. Methods in Enzymology, 183:193{210, 1990. 7