Learning to Align Sequences: A Maximum-Margin Approach

Size: px

Start display at page:

Download "Learning to Align Sequences: A Maximum-Margin Approach"

Gloria Moody
6 years ago
Views:

1 Learnng to Algn Sequences: A Maxmum-Margn Approach Thorsten Joachms Department of Computer Scence Cornell Unversty Ithaca, NY 4853 tj@cs.cornell.edu August 28, 2003 Abstract We propose a dscrmnatve method for learnng the parameters (e.g. cost of substtutons, deletons, nsertons) of lnear sequence algnment models from tranng examples. Whle the resultng tranng problem leads to an optmzaton problem wth an exponental number of constrants, we present a smple algorthm that fnds an arbtrarly close approxmaton after consderng only a subset of the constrants that s lnear n the number of tranng examples and polynomal n the length of the sequences. We also evaluate emprcally that the method effectvely learns good parameter values whle beng computatonally feasble. Introducton Methods for sequence algnment are common tools for analyzng sequence data rangng from bologcal applcatons [3] to natural language processng []. They can be thought of as measures of smlarty between sequences where the smlarty score s the result of a dscrete optmzaton problem that s typcally solved va dynamc programmng. Whle the dynamc programmng algorthm determnes the general noton of smlarty (e.g. local algnment vs. global algnment), any such smlarty measure requres specfc parameter values before t s fully specfed. Examples of such parameter values are the costs for substtutng one sequence elements for another, as well as costs for deletons and nsertons. These parameter values greatly determne how well the measure works for a partcular task. In ths paper we tackle the problem of nferrng the parameter values from tranng data. Our goal s to fnd parameter values so that the resultng smlarty measure best reflects the desred noton of smlarty. We consder tranng data where we have examples of smlar and dssmlar sequences. Instead of assumng a generatve model of sequence algnment (e.g. [9]), we take a dscrmnatve approach to tranng. In partcular, we am to fnd the set of parameter values that corresponds to the best smlarty

2 s : s 2 : a: m m s m m m m Fgure : Example of a local sequence algnment. measure a gven algnment algorthm can represent. Takng a large-margn approach, we show that we can solve the resultng tranng problem effcently for a large class of algnment algorthms that mplement a lnear scorng functon. Whle the resultng optmzaton problems have exponentally many constrants, our algorthm fnds an arbtrarly good approxmaton after consderng only a subset of constrants that scales polynomally wth the length of the sequences and lnearly wth the number of tranng examples. We emprcally and theoretcally analyze the scalng of the algorthm and show that the learned smlarty score performs well on test data. 2 Sequence Algnment Sequence algnment computes a smlarty score for two (or more) sequences s and s 2 from an alphabet ±=f; ::; ffg. An algnment a s a sequence of operatons that transforms one sequence nto the other. In global algnment, the whole sequence s transformed. In local algnment, only an arbtrarly szed subsequence s algned. Commonly used algnment operatons are match (m), substtuton (s), deleton (d) and nserton (). An example of a local algnment s gven n Fgure. In the example, there are 6 matches, substtuton, and nserton/deleton. Wth each operaton there s an assocated cost/reward. Assumng a reward of 3 for match, a cost of for substtuton, and a cost of 2 for nserton/deleton, the total algnment score D ~w (s ;s 2 ;a) n the example s 5. The optmal algnment a Λ s the one that maxmzes the score for a gven cost model. More generally, we consder algnment algorthms that optmze a lnear scorng functon D ~w (s ;s 2 ;a)= ~w T ff(s ;s 2 ;a) where ff(s ;s 2 ;a) s some functon that generates features based on an algnment a (e.g. # of substtutons, # of deletons) and w s a gven cost vector. ff(s ;s 2 ;a) depends on the partcular algnment algorthm and can be any feature vector. Fndng the optmal algnment corresponds to the followng optmzaton problem D ~w (s ;s 2 )=max a [D ~w (s ;s 2 ;a)] = max a ~w T ff(s ;s 2 ;a) Λ : () Ths type of problem s typcally solved va dynamc programmng. In the followng we consder local algnment va the Smth/Waterman algorthm [0]. However, the results can be extended to any algnment algorthm that optmzes a lnear scorng functon and that solves () globally optmally. Ths also holds for other structures besdes sequences. 2

3 3 Inverse Sequence Algnment Inverse sequence algnment s the problem of usng tranng data to learn the parameters ~w of an algnment model and algorthm so that the resultng smlarty measure D ~w (s ;s 2 ) best represents the desred noton of smlarty on new data. Whle prevous approaches to ths problem exst [5, ], they are lmted to specal cases and very small numbers of parameters. We wll present an algorthm that apples to any lnear algnment model wth no restrcton on the functon or number of parameters. We assume the followng model, for whch the notaton s nspred by proten algnment. In proten algnment the goal s a smlarty measure so that homologous proten sequences are smlar to a natve sequence, but also so that non-homologous (.e. decoy) sequences are not smlar to the natve sequence. We assume that examples are generated..d. accordng to a dstrbuton P (s N ;s H ;S D ). s N s the natve sequence, s H the homologous sequence, and S D s a set of decoy sequences s D ; :::; s D d. The goal s a smlarty measure so that natve sequence s N and homolog s H are more smlar than the natve sequence s N and any decoy s Dj,.e. D ~w (s N ;s H ) >D ~w (s N ;s Dj ): (2) The goal of learnng can be formulated n two reasonable ways based on two dfferent loss functons. In the frst settng, the goal s to fnd the cost parameters ~w that mnmze the probablty Err P ( ~w) that the smlarty wth any decoy sequence D ~w(s N ;s Dj ) s hgher than the smlarty wth the homolog sequence D ~w (s N ;s H ). Err P ( ~w) = Z max D ~w (s N ;s Dj ) >D ~w (s N ;s H ) dp (s N ;s H ;S D ) (3) s D j 2S D The functon (:) returns, f ts argument s false, 0 otherwse. In the second settng, the goal s less ambtous. We do not want a homolog that s necessarly more smlar than all decoys. Instead, we merely want to have the homolog sequence ranked hghly, but not necessarly on top of the rankng. In partcular, we could optmze the average rank of the homolog. Ths s equvalent to the followng knd of error rate. Z X ErrP f D ( ~w) = ~w (s N ;s Dj ) >D ~w (s N ;s ) H dp (s N ;s H ;S D ) (4) s D j 2S D In the followng, we consder only the frst type of error Err P ( ~w) for concseness. 4 A Maxmum-Margn Approach to Learnng the Cost Parameters The dstrbuton P (s D ;s H ;S D ) s unknown. However, we have a tranng sample S from P (s N ;s H ;S D ). It conssts of natve sequences s N ; :::; sn n, homolog sequences s H ; :::; sh n, and a set of decoy sequences S D ; :::; SD n for each natve sn ;:::;sn n. As a smplfyng assumpton, we assume that between natve and homolog sequences the algnment a NH ; :::; a NH n of maxmum score s known 2. The goal s to fnd an optmal 2 For proten algnment, for example, ths could be generated va structural algnment. 3

4 ~w so that the error rate Err P ( ~w) s low. Frst, we consder the case where there exsts a ~w such that the error on the tranng set Err S ( ~w) = n X = ψ max s D j 2S D! D ~w (s N ;sdj ) <D ~w (s N ;sh ) s zero. Snce we assume a scorng functon that s lnear n the parameters D ~w (s ;s 2 ;a)=~w T ff(s ;s 2 ;a); (6) the condton of zero tranng error can be wrtten as a set of lnear nequalty constrants. There s one constrant for each combnaton of natve sequence s N, decoy sequence s Dj, and possble algnment a of s N nto s Dj. 2 S D 8a : D ~w (s N ;sdj ;a) <D ~w(s N ;sh ;anh ) ::: (7) n 2 SD n 8a : D ~w(s N n ;sdj n ;a) <D ~w(s N n ;sh n ;anh ) Ths approach of wrtng the tranng problem as a lnear system follows the method n [8] proposed for the specal case of global algnment wthout free nsertons/deletons. However, for the general case n Equaton (7) the number of constrants s exponental, snce the number of algnments a between s N and s Dj s exponental n the length of s N and s Dj. Unlke the restrcted case n [8], standard optmzaton algorthms cannot handle ths sze problem. To overcome ths lmtaton, n Secton 5 we wll propose an algorthm that explots the specal structure of Equaton (7) so that t needs to examne only a subset that s at most lnear n the number of tranng examples n. If the set of nequaltes n Equaton (7) s feasble, there wll typcally be more than one soluton ~w Λ. To specfy a unque soluton, we select the ~w for whch the smlarty of the homolog D ~w (s N n ;sh n ;pnh ) s unformly most dfferent from the smlarty max D s j 2S D D w (s N ;sd ) of the best algnment wth any decoy. Ths corresponds to the maxmum-margn prncple employed n Support Vector Machnes [2]. Denotng the margn by ff and restrctng the L 2 norm of ~w to make the problem well-posed, ths leads to the followng optmzaton problem. max ~w (5) 2 ff2 (8) 2 S D 8a : D ~w(s N ;sdj ;a)» D ~w(s N ;sh ;anh ) ff (9) ::: (0) n 2 Sn D 8a : D ~w (s N n ;sdj n ;a)» D ~w(s N n ;sh n ;anh ) ff () jj ~wjj = (2) Due to the lnearty of the smlarty functon (6), the length of ~w s a free varable and we can fx t to =ff. Substtutng for ff and rearrangng leads to the equvalent optmzaton problem mn ~w 2 ~wt ~w (3) 4

5 2 S D 8a : ff(s N ;sh ;anh ) ff(s N ;sdj ;a) ~w (4) ::: (5) ;a) n 2 Sn D 8a : ff(s N n ;s H n ;anh ) ff(s N n ;sdj n ~w (6) Snce ths quadratc program (QP) has a postve-defnte objectve functon and (feasble) lnear constrants, t s strctly convex. Ths means t has a unque global mnmum and no local mnma [4]. The constrants are smlar to the ordnal regresson approach n [6] and t has a structure smlar to the Rankng SVM descrbed n [7] for nformaton retreval. However, the number of constrants s much larger. To allow errors n the tranng set, we ntroduce slack varables ο [2]. Correspondng to the error measure Err P ( ~w) we have one slack varable for each natve sequence. Ths s dfferent from a normal classfcaton or regresson SVM, where there s a dfferent slack varable for each constrant. The slacks enter the objectve functon accordng to a trade-off parameter C. For smplcty, we consder only the case where the slacks enter the objectve functon squared. nx mn ~w; ο ~ 2 ~wt ~w + C ο 2 (7) = 2 S D 8a : ff(s N ;sh ;anh ) ff(s N ;sdj ;a) ~w ο (8) ::: (9) ;a) n 2 SD n 8a : ff(s N n ;s H n ;anh ) ff(s N n ;sdj n ~w ο n (20) P n Note that = ο2 s an upper bound on the tranng error Err S ( ~w). Therefore, the algorthm mnmzes the tranng error Err S ( ~w) whle maxmzng margn. 5 Tranng Algorthm Due to the exponental number of constrants, standard optmzaton software cannot handle the number of constrants resultng from problems of nterestng sze. However, by explotng the specal structure of the problem, we propose the followng algorthm that fnds the soluton of (7)-(20) after examnng only a small number of constrants. The algorthm proceeds by greedly addng constrants from (8)-(20) to a workng set K. The algorthm stops, when all constrants n (8)-(20) are fulflled up to a precson of ffl. 5

6 Input: natve sequences s N ; :::; sn n, homolog sequences sh ; :::; sh n, algnments anh ;:::;a NH n, sets of decoy sequences S D ;:::;SD n, tolerated approxmaton error ffl 0. K = ;, ~w =0, ~ ο =0 repeat ffl K org = K ffl for from to n untl(k = K org ) Output: ~w for j from to js D j va dynamc program- h Λ fnd a Λ = argmax a ~w T ff(s N ;sdj ;a) mng ~w T (ff(s N ;sh ;anh ) ff(s N ;sdj Λ f ~w T (ff(s N ;sh ;anh ) ff(s N ;sdj ;a Λ )) < ο ffl n o K = K [ ;a Λ )) ο P solve QP ( ~w; ο)=argmn ~ ~w; ο ~ 2 ~wt n ~w + C K. = ο2 subject to The followng two theorems show that the algorthm returns the correct soluton and that t stops after O(n) teratons through the repeat-loop. Theorem (CORRECTNESS) The algorthm returns an approxmaton to (7)-(20) that has an objectve value not hgher than the soluton of (7)-(20), and that fulflls all constrants up to a precson of ffl. Forffl =0, the algorthm returns the soluton ( ~w Λ ; ~ ο Λ ) of (7)-(20). Proof Let ~w Λ and ο Λ be the soluton of (7)-(20). Snce the algorthm solves the QP on a subset P of the constrants n each P teraton, t must return a soluton ~w wth 2 ~wt ~w + n C = ο2» 2 ~wλt ~w Λ n + C = ολ2. Ths follows from the fact that decreasng the feasble regon cannot lead to a lower mnmum. It s left to show that the algorthm does not termnate before all constrants (8)- (20) are fulflled up to precson ffl. In the fnal teraton, the algorthm fnds the hghest scorng algnment a Λ for each decoy and checks whether the constrant ~w T (ff(s N ;sh ;anh ) ff(s N ;sdj ;a Λ )) < ο ffl s fulflled. Three cases can occur: Frst, the constrant can be volated and the algorthm wll not termnate yet. Second, t s already n K and s fulflled by constructon. Thrd, the constrant s not n K but fulflled anyway and t s not added. If the constrant s fulflled, the constrants for all other algnments nto ths decoy are fulflled as well, snce we checked the constrant for whch the margn was smallest for the gven ~w. It follows that the algorthm termnates only f all constrants are fulflled up to precson ffl. Smlarly, t can be shown that for ffl>0 the algorthm fnds a soluton so that the KKT-condtons of the Wolfe Dual of (7)-(20) are fulflled up to a precson of ffl. We omt the proof for brevty. 6

7 Theorem 2 (TERMINATION) The algorthm stops after addng at most 2V R 2 (2) ffl 2 constrants to the set K. V s the mnmum of (7)-(20) and R 2 s a constant bounded by the maxmum of (ff(s N ;sh ;anh ) ff(s N ;sdj ;a)) 2 + 2C. Proof The frst part of the proof s to show that the objectve value ncreases by some constant wth every constrant that P s added to K. Denote wth V k the soluton V k = P ( ~w k Λ ; ο ~ k Λ ) = mn ~w; ο ~ 2 ~wt n ~w + C = ο2 subject to K k after addng k constrants. Ths prmal optmzaton problem can be transformed nto an equvalent problem of the form V k = P ( ~w k 0Λ )=mn ~w 0 2 ~w0t ~w 0 subject to Kk 0, where each constrant has the form ~w T ~x wth p ~x =(ff(s N ;sh ;anh ) ff(s N ;sdj ;a Λ ); 0; :::;0;= 2C;0;:::;0). Its correspondng Wolfe dual s D(~ff Λ k )=max ~ff 0 P k = ff 2 P k = P k j= ff ff j ~x ~x j. At the soluton D(~ff Λ k ) = P ( ~w0λ k ) = P ( ~wλ k ; ο ~ k Λ ) = V k and for every feasble pont D(~ff)» P ( ~w; ο). ~ Prmal and dual are P P connected va ~w 0Λ k = = ffλ ~x. Addng a constrant to the dual wth ~w 0ΛT k ~x k+ = = ffλ ~x ~x k+» ffl means extendng the dual to kx kx kx kx D k+ (~ff Λ k+ )= max ff ff ff j ~x ~x j + ff k+ ff k+ ff ~x ~x k+ ~ff k ff2 k+ ~x2 k+ = = j= D k (~ff Λ k )+ max ff k+ ff k+ ff k+ 0 kx = = ff Λ ~x ~x k+ 2 ff2 k+ ~x2 k+ D k (~ff Λ k )+ max ff k+ ff k+ ( ffl) ff k+ 0 2 ff2 k+ ~x2 k+ Solvng the remanng scalar optmzaton problem over ff k+ shows that ff Λ k+ 0 and that V k+ V k + ffl2 2R. 2 Snce the algorthm only adds constrants that are volated by the current soluton by more than ffl, after addng k max = 2VR2 ffl constrants the soluton 2 V kmax over the subset K kmax s at least V kmax V 0 + 2VR2 ffl 2 ffl 2 2R 2 =0+V. Any addtonal constrant that s volated by more than ffl would lead to a mnmum that s larger than V. Snce the mnmum over a subset of constrants can only be smaller than the mnmum over all constrants, there cannot be any more constrants volated by more than ffl and the algorthm stops. The theorem drectly leads to the concluson that the maxmum number of constrants n K scales lnearly wth the number of tranng examples n, snce V can be upper bounded as V» C Λ n usng the feasble pont ~w =0and ~ ο =n (7)-(20). Furthermore, t scales only polynomally wth the length of the sequences, snce R s polynomal n the length of the sequences. Whle the number of constrants can potentally explode for small values of ffl, experence wth Support Vector Machnes for classfcaton showed that relatvely large 7

8 ThreeParam Tran ThreeParam Test IndvSubstCost Tran IndvSubstCost Test Average Tran/Test Error (delta) Number of Tranng Examples Fgure 2: Left: Tran and test error rates for the 3 and the 403 parameter model dependng on the number of tranng examples. Rght: Typcal learned substtuton matrx after 40 tranng examples for the 403-parameter model. values of ffl are suffcent wthout loss of generalzaton performance. We wll verfy the effcency and the predcton performance of the algorthm emprcally n the followng. 6 Experments To analyze the behavor of the algorthm under varyng condtons, we constructed a synthetc dataset accordng to the followng sequence and algnment model. The natve sequence and the decoys are generated by drawng randomly from a 20 letter alphabet ±=f; ::; 20g so that letter c 2 ± has probablty c=20. Each sequence has length 50, and there are 0 decoys per natve. To generate the homolog, we generate an algnment strng of length 30 consstng of 4 characters match, substtute, nsert, delete. For smplcty of llustraton, substtutons are always c! (c mod 20) +. Whle we experment wth several algnment models, we only report typcal results here where matches occur wth probablty 0:2, substtutons wth 0:4, nserton wth 0:2, deleton wth 0:2. The homolog s created by applyng the algnment strng to a randomly selected substrng of the natve. The shortenng of the sequences through nsertons and deletons s padded by addtonal random characters. Fgure 2 shows tranng and test error rates for two models dependng on the number of tranng examples averaged over 0 trals. The frst model has only 3 parameters ( match, substtute, nsert/delete ) and uses a unform substtuton matrx. The second model also learns the 400 parameters of the substtuton matrx, resultng n a total of 403 parameters. We chose C = 0:0 and ffl = 0:. The left-hand graph of Fgure 2 shows that for the 403-parameter model, the generalzaton error s hgh for small numbers of tranng examples, but quckly drops as the number of examples ncreases. The 3-parameter model cannot ft the data as well. Its tranng error starts out much hgher and tranng and test error essentally converge after only a few examples. The rght-hand graph of Fgure 2 shows the learned matrx of substtuton costs for the 403-parameter model. As desred, the elements of the matrx are close to zero except for the off-dagonal. Ths captures the substtuton model c! (c mod 20)+. Fgure 3 analyzes the effcency of the algorthm va the number of constrants that 8

9 ThreeParam IndvSubstCost ThreeParam IndvSubstCost Average Number of Constrants Average Number of Constrants Number of Tranng Examples Epslon Fgure 3: Number of constrants added to K dependng on the number of tranng examples (left) and the value of ffl (rght). If not stated otherwse, ffl =0:, C =0:0, and n =20. are added to K before convergence. The left-hand graph shows the scalng wth the number of tranng examples. As predcted by Theorem 2, the number of constrants grows (sub-)lnearly wth the number of examples. Furthermore, the actual number of constrants s small enough so that t can easly be handled by standard quadratc optmzaton software. The rght-hand graph shows how the number of constrants n the fnal K changes wth log(ffl). The observed scalng appears to be better than suggested by the upper bound n Theorem 2. A good value for ffl s 0:. We observed that larger values lead to worse predcton accuracy, whle smaller values decrease effcency whle not provdng further beneft. 7 Conclusons The paper presented a dscrmnatve learnng approach to nferrng the cost parameters of a lnear sequence algnment model from tranng data. We proposed an algorthm for solvng the resultng tranng problem and showed ts effcency both theoretcally and emprcally. Experments show that the algorthm can effectvely learn the algnment parameters on a synthetc task. We are currently applyng the algorthm to learnng algnment models for proten homology detecton. An open queston s whether t s possble to remove the assumpton that the algnment between natve and homolog s known whle mantanng the tractablty of the problem. Acknowledgments Many thanks to Ron Elber and Tamara Galor for ther nsghtful dscussons and ther contrbuton to the research that lead to ths paper. 9

10 References [] R. Barzlay and L. Lee. Bootstrappng lexcal choce va multple-sequence algnment. In Conference on Emprcal Methods n Natural Language Processng (EMNLP), [2] Cornna Cortes and Vladmr N. Vapnk. Support vector networks. Machne Learnng Journal, 20: , 995. [3] R. Durbn, S. Eddy, A. Krogh, and G. Mtchson. Bologcal Sequence Analyss. Cambrdge Unversty Press, 998. [4] P. E. Gll, W. Murray, and M. H. Wrght. Practcal Optmzaton. Academc Press, 98. [5] D. Gusfeld and P. Stellng. Parametrc and nverse-parametrc sequence algnment wth xparal. Methods n Enzymology, 266:48 494, 996. [6] R. Herbrch, T. Graepel, and K. Obermayer. Large margn rank boundares for ordnal regresson. In Advances n Large Margn Classfers, pages MIT Press, Cambrdge, MA, [7] T. Joachms. Optmzng search engnes usng clckthrough data. In Proceedngs of the ACM Conference on Knowledge Dscovery and Data Mnng (KDD), [8] J. Meller and R. Elber. Lnear programmng optmzaton and a double statstcal flter for proten threadng protocols. Protens Structure, Functon, and Genetcs, 45:24 26, 200. [9] S. E Rstad and P. N. Yanlos. Learnng strng edt dstance. IEEE Transactons on Pattern Recognton and Machne Intellgence, Vol. 20(5): , 998. [0] T. Smth and M. Waterman. Identfcaton of common molecular subsequences. Journal of Molecular Bology, 47:95 97, 98. [] Fangtng Sun, D. Fernandez-Baca, and We Yu. Inverse parametrc sequence algnment. In Internatonal Computng and Combnatorcs Conference (CO- COON), [2] V. Vapnk. Statstcal Learnng Theory. Wley, Chchester, GB,

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support