1 Efficient Splice Site Prediction with Context-Sensitive Distance Kernels

Size: px

Start display at page:

Download "1 Efficient Splice Site Prediction with Context-Sensitive Distance Kernels"

Violet Moody
5 years ago
Views:

1 1 Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels Berard Maderick, Feg Liu ad Bram Vaschoewikel This paper presets a compariso betwee differet cotext-sesitive kerel fuctios for doig splice site predictio with a support vector machie. Four types of kerel fuctios will be used: liear-, polyomial-, radial basis fuctio- ad egative distace-based kerels. Domai-kowledge ca be icorporated ito the kerels by icorporatig statistical measures or by directly pluggig i distace fuctios defied o the splice site istaces. From the experimetal results it becomes clear that the radial basis fuctio-based kerels get the best accuracies. However, because classificatio speed is of crucial importace to the splice site predictio system, this kerel is computatioally too expesive. Nevertheless, i geeral icorporatig domai kowledge does ot oly improve classificatio accuracy, but also reduces model complexity which i its tur agai icreases classificatio speed. 1. Itroductio A importat task i bio-iformatics is the aalysis of geome sequeces for the locatio ad structure of their gees, ofte referred to as gee fidig. I geeral, a gee ca be defied as a regio i a DNA sequece that is used i the productio of a specific protei. I may gees, the DNA sequece codig for proteis, called exos, may be iterrupted by stretches of o-codig DNA, called itros. A gee starts with a exo, is the iterrupted by a itro, followed by aother exo, itro ad so o, util it eds i a exo. Splicig is the process by which the itros are subtracted from the exos. Hece we ca make a distictio betwee two differet splice sites: i) the exoitro boudary, referred to as the door site ad ii) the itro-exo boudary, referred to as the acceptor site. Splice site predictio, a importat subtask i gee fidig systems, is the automatic idetificatio of those regios i the DNA sequece that are either door sites or acceptor sites [?]. Because splice site predictio istaces ca be represeted by a cotext of a umber of ucleotides before ad after the cadidate splice site, it is called a cotext-depedet classificatio task. I this paper we do splice predictio with support vector machies (SVMs) usig kerel fuctios that take ito accout the iformatio available at differet positios i the cotexts. I this sese the kerel fuctios are called cotextsesitive. This is explaied i Sectios?? ad??.

2 Advaces i Systems Modellig ad ICT Applicatios More precisely, i a support vector machie, the data is first mapped oliearly from the origial iput space X to a high-dimesioal Hilbert space called the feature space F ad the separated by a maximum-margi hyperplae, i.e. liearly, i that space F. By makig use of the kerel trick, the mappig : X F remais implicit, ad as a result we avoid workig i the high-dimesioal feature space F. Moreover, because the mappig is o-liear, the decisio boudary which is liear i F correspods to a o-liear decisio boudary i the iput space X. Oe of the most importat desig decisios i SVM learig is the choice of kerel fuctio K : X X because the maximum-margi hyperplae is defied completely by ier products of vectors i the Hilbert feature space F usig the kerel fuctio K. Sice K takes elemets x ad y from the iput space X ad calculates the ier products of (x) ad (y) i the Hilbert feature space F without havig to represet or eve to kow the exact form of the elemets (x) ad (y). As a cosequece the mappig remais implicit ad we have a computatioal beefit [?]. I the light of the above it is ot hard to see that computatioal efficiecy of K is crucial for the success of the classificatio process. We refer to Sectio?? for more o theoretical backgroud cocerig the SVM. As a result, the learig process ca beefit a lot from the use of special purpose similarity or dissimilarity measures i the calculatio of K [?,?,?,?]. However, icorporatig such kowledge i a kerel fuctio is o-trivial sice a kerel fuctio K has to satisfy a umber of properties that result directly from the defiitio of a ier product. I this paper we will cosider two types of kerel fuctios that ca make direct use of distace fuctios defied o cotexts themselves: i) egative distace kerels ad ii) radial basis fuctio kerels. This is explaied i Sectio??. Furthermore, because classificatio speed is of crucial importace to a splice site predictio system the used kerel fuctios should be computatioally very efficiet. For that reaso, i related work o splice site predictio with a SVM a liear kerel is chose i favor of computatioal efficiecy but at the cost of some accuracy [?]. I this light most of the kerels preseted here will probably be too expesive, therefore we also show results for cotext-sesitive liear kerels ad from these results it ca be see that the classificatio speed ca be further icreased while at the same time precisio ad accuracy of the predictios are a little higher. This is discussed i Sectio??. 2. Cotext-Depedet Classificatio I this paper we cosider classificatio tasks where it is the purpose to classify a focus symbol i a sequece of symbols, based o a umber of symbols before ad after the focus symbol. The focus symbol, together with the symbols before ad after it, is called a cotext ad applicatios that rely o such cotexts will be called cotext-depedet. Splice site predictio is a typical example of a cotextdepedet classificatio task. Here, each symbol is oe of the four ucleotides {A,C,G, T}.

3 Key Note 1: Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels 2.1 Cotexts We start with a defiitio of a cotext followed by a illustratio i the framework of splice site predictio. Defiitio 1. A cotext s p is a sequece of symbols si D with p symbols before ad q symbols after a focus symbol s p at positio p as follows s p = (s 0,..., s p,... sp+q (1) with (p + q) + 1 the legth of the cotext, with D the dictioary of all symbols, with D = m ad with p the left cotext size ad with q the right cotext size. Example 1. Remid from the itroductio that i splice site predictio it is the purpose to automatically idetify those regios i a DNA sequece to be door sites or acceptor sites. Essetially, DNA is a sequece of ucleotides represeted by a four character alphabet or dictioary D = {A,C,G, T}. Moreover, a acceptor site always cotais the AG diucleotide ad a door site always cotais the GT diucleotide. I this light splice site predictio istaces ca be represeted by a cotext of a umber of ucleotides before ad after the AG/GT diucleotides. More precisely, give a fragmet of a DNA sequece,... CCATTGGTGGCAGCCAG... the cadidate door site give by the diucleotide GT ca be represeted by a cotext i terms of Defiitio?? as s p = A, T, T, G, GT SP S 0..., SP, G, G, C SP +..., SP + Q with p = 4 the left cotext size ad q = 3 the right cotext size ad with (p + q) + 1 = 8 the total legth of the cotext. Furthermore, for splice site predictio there is o eed to represet the AG/ GT diucleotides, because two separate classifiers are traied, oe for door sites ad oe for acceptor sites. I this light the oly possible symbols occurrig i the cotexts are give by the dictioary D. Note that, for reasos of computatioal efficiecy, i practice the symbols i the cotexts will be represeted by a iteger, more precisely by assigig all the symbols that occur i the traiig set a uique idex ad subsequetly usig that idex i the cotext istead of the symbols themselves. 2.2 The Overlap Metric The most basic distace fuctio defied o cotexts is called the overlap metric, it simply couts the umber of mismatchig symbols at correspodig positios i two cotexts.

4 Advaces i Systems Modellig ad ICT Applicatios Defiitio 2. be a set with cotexts s p ad q t p with = (p + q) + 1 the legth of the cotexts, with symbols si, ti 2 D the dictioary of all distict symbols with D = m ad let w be a cotext weight vector. The the overlap metric d OM : + is defied as - + : d OM -1 ( s, t) = w δ i ( s i, ti i= 0 ) with δ : = + defied as wi if si ti δ (s i,t i )= { 0 else (3) with w i 0 a cotext weight for the symbol at positio i. Next, we make a distictio betwee two cases: i) if all w i = 1 o weightig takes place ad the metric is referred to as the simple overlap metric d SOM ad ii) otherwise a positio depedet weightig does take place ad the metric is referred to as the weighted overlap metric d WOM. A questio that ow aturally rises is: what measures ca be used to weigh the differet cotext positios? Iformatio theory provides may useful tools for measurig statistics i the way described above. I this work we made use of three measures kow as i) iformatio gai [?], ii) gai ratio [?] ad iii) shared variace [?]. For more details the reader is referred to the related literature. 2.3 The Modified Value Differece Metric The Modified Value Differece Metric (MVDM) [?] is a powerful method for measurig the distace betwee sequeces of symbols like the cotexts cosidered here. The MVDM is based o the Stafill-Waltz Value Differece Metric itroduced i 1986 [?]. The MVDM determies the similarity of all the possible symbols at a particular cotext positio by lookig at co-occurrece of the symbols with the target class. Cosider the followig defiitio. Defiitio 3. Let be a set with cotexts s p ad q t p with = (p + q) + 1 the legth of the cotexts as before, with compoets s i ad t i D the dictioary of all distict symbols with D = m. The the modified value differece metric d MVDM : + is defied as i= 0 δ + r : d MVDM ( s, t) ( s i, ti ) (4) with r a costat ofte equal to 1 or 2 ad with : D D + the differece of the coditioal distributio of the classes as follows: -

5 Key Note 1: Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels M δ (s i,t i ) r = p (y j \s i ) p(y j \t i ) r (5) j = with y j the class labels ad with M the umber of classes i the classificatio problem uder cosideratio. 3. Cotext-Sesitive Kerel Fuctios I the followig sectio we will itroduce a umber of kerel fuctios that make direct use of the distace fuctios d SOM, d WOM ad d MVDM defied i the previous sectio. I the case of d WOM ad d MVDM the kerels are called cotextsesitive as they take ito accout the amout of iformatio that is preset at differet cotext positios as discussed above. 3.1 Theoretical Backgroud Remid that i the SVM framework classificatio is doe by cosiderig a kerel iduced feature mappig Φ that maps the data from the iput space X to a high dimesioal Hilbert space F ad classificatio is doe by meas of a maximummargi hyperplae i that space F. This is doe by makig use of a special fuctio called a kerel. Defiitio 4. A kerel K : X X is a symmetric fuctio so that for all x ad x' i X, K(x, x') = φ (x), (x') where φ is a (o-liear) mappig from the iput space X ito the Hilbert space F provided with the ier product.,.. However, ot all symmetric fuctios over X X are kerels that ca be used i a SVM, because a kerel fuctio eeds to satisfy a umber of coditios imposed by the fact that it calculates a ier product i F. More precisely, i the SVM framework we distiguish two classes of kerel fuctios: i) positive semidefiite kerels (PSD) ad ii) coditioally positive defiite (CPD) kerels. Whereas a PSD kerel ca be cosidered as oe of the most simple geeralizatios of oe of the simplest similarity measures, i.e. the ier product, CPD kerels ca be cosidered as geeralizatios of the simplest dissimilarity measure, i.e. the distace x x' [?,?,?]. Oe type of CPD kerel that is of particular iterest to us is give i [?] from which we quote the followig two theorems. Theorem 1. Let X be the iput space, the the fuctio K : X X : K d (x x') = x x' β with 0 < β 2 (6) is CPD. The kerel K defied i this way is referred to as the egative distace kerel. Aother result that is of particular iterest to us relates a CPD K to a PSD kerel K by pluggig i K ito the expoet of the stadard radial basis fuctio kerel, this is expressed i the followig theorem [?]:

6 Advaces i Systems Modellig ad ICT Applicatios Theorem 2. Let X be the iput space ad let K : X X be a kerel, the K is CPD if ad oly if K rbf (x x')= exp (γk (x x') (7) is PSD for all γ> 0. The kerel K rbf defied i this way is referred to as the radial basis fuctio kerel. For Theorem?? to work, it is assumed that X where is a ormed vector space. But, for cotexts i particular ad sequeces of symbols i geeral oe ca ot defie a orm like i the RHS of Equatio??. More precisely, give the results above, if we wat to use a arbitrary distace dx defied o the iput space X i a kerel K, we should be able to express it as d x( x x') = x x' from which it the automatically follows that d x is CPD by applicatio of Theorem??. I our case however, sice the iput space X the set of all cotexts of legth, the distaces d SOM, d WOM ad d MVDM we would like to use ca therefore ot be expressed i terms of Theorem??. Nevertheless, i previous work it has bee show that d SOM, d WOM ad d MVDM are CPD [?,?,?,?], this will be briefly explaied ext. For more details the reader is referred to the literature. More precisely, for the overlap metric defied o the cotexts it ca be show that it correspods to a orthoormal vector ecodig of those cotexts [?,?,?]. I the orthoormal vector ecodig every symbol i the dictioary D is represeted by a uique uit vector ad complete cotexts are formed by cocateatig these uit vectors. Notice that this is actually the stadard approach to cotext-depedet classificatio with SVMs [?,?] ad i this light the o-sesitive liear, polyomial, radial basis fuctio ad egative distace kerels employig the simple overlap metric (i.e. the uweighted case) preseted ext, are actually equivalet to the stadard liear, polyomial, radial basis fuctio ad egative distace kerel applied to the orthoormal vector ecodig of the cotexts. Fially, for the MVDM with r = 2 it ca be show that it correspods to the Euclidea distace i a trasformed space, based o a probabilistic reformulatio of the MVDM preseted i [?,?]. 3.2 A Weighted Polyomial Kerel The first kerel we will defie here is based o Equatio?? of the defiitio of the overlap metric from Defiitio??. I the same way as before, we make a distictio betwee the uweighted o-sesitive case ad the weighted cotextsesitive case, for more details the reader is referred to [?,?,?]. Defiitio 5. Let X be the iput space with cotexts s p ad t p with = (p+q)+1 the legth of the cotexts ad s i, t i D the symbols at positio i i the cotexts as before, ad let w be a cotext weight vector, the we defie the simple overlap kerel KSOK : X X as

7 Key Note 1: Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels K SOK d = s -, t δ( s i, ti ) + c (8) i = 0 with c 0, d > 0 ad w = 1, the weighted overlap kerel K WOK : X X is defied i the same way but with a cotext weight vector w Negative Distace Kerels Next, we give the defiitios of three egative distace kerels employig the distaces d SOM, d WOM ad d MVDM, for more details we refer to [?,?,?]. We start with the defiitio of two egative distace kerels usig the overlap metric from Defiitio??. I the same way as before, we make a distictio betwee the uweighted, o-sesitive case d SOM ad the weighted, cotext-sesitive case d WOM. Defiitio 6. Let X be the iput space with cotexts s p ad t p with = (p+q)+1 the legth of the cotexts ad s i, t i D the symbols at positio i i the cotexts as before, ad let w be a cotext weight vector, the we defie the egative overlap distace kerel K NODK : X X as K NODK s - β, t = d SOM ( s -, t) (9) with 0 <β 2 ad w = 1 as before, the egative weighted distace kerel K NWDK : X X is defied i the same way but substitutig d WOM for d SOM i the RHS of Equatio??, i.e. with a cotext weight vector w 1. Similarly, for the MVDM from Defiitio?? we ca defie a egative distace type kerel as follows. Defiitio 7. Let X be the iput space with cotexts s p ad t p with = (p+q)+1 the legth of the cotexts ad s i, t i D the symbols at positio i i the cotexts as before, the we defie the egative modified distace kerel K NMDK : X X as 2 K NMDK s - β, t = d MVDM ( s -, t) (10) 2 with 0 < β 2as before.

8 Advaces i Systems Modellig ad ICT Applicatios However, it should be oted that for r = 1 i the defiitio of the MVDM d MVDM is ot CPD ad thus for r = 1 the kerel K NMDK will also ot be CPD. Nevertheless, give the good empirical results we will use K NMDK with d MVDM ad r = 1 ayway. 3.4 Radial Basis Fuctio Kerels Next, we give the defiitios of three radial basis fuctio kerels employig the distaces d SOM, d WOM ad d MVDM, for more details we refer to [?,?,?]. We start with the defiitio of two radial basis fuctio kerels employig the overlap metric from Defiitio??. I the same way as before, we make a distictio betwee the uweighted o-sesitive case d SOM ad the weighted cotext-sesitive case d WOM. Defiitio 8. Let X be the iput space with cotexts s p ad q t p with = (p+q)+1 the legth of the cotexts ad s i, t i D the symbols at positio i i the cotexts as before, ad let w be a cotext weight vector, the we defie the overlap radial basis fuctio kerel K ORBF : X X as K ORBF s -, t = exp ( γ d SOM ( s -, t) ) with γ > 0 as before, with w = 1 ad the weighted radial basis fuctio kerel K WRBF : X X is defied i the same way but substitutig d WOM for d SOM i the RHS of Equatio??, i.e. with a cotext weight vector w 1. Similarly, for the MVDM from Defiitio?? we ca defie a radial basis fuctio type kerel as follows. - (11) Defiitio 9. Let X be the iput space with cotexts s p ad t p with = (p+q)+1 the legth of the cotexts ad s i, t i D the symbols at positio i i the cotexts as before, the we defie the modified radial basis fuctio kerel K MRBF : X X as K MRBF with γ > 0 as before. s -, t = exp ( γ d MVDM ( s -, t) ) It should however be oted that, with respect to the discussio above, i.e. that for r = 1 the distace d MVDM ad correspodig kerel K NMDK are ot CPD ad therefore here for r = 1 the kerel K MRBF will ot be PSD. Nevertheless, give the good empirical results we used it ayway. (12)

9 Key Note 1: Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels 4. Experimets We have doe a umber of experimets, first of all we wated to validate the feasibility of our approach ad compare our kerel fuctios that operate o cotexts directly ad see whether they are doig at least as well ad hopefully better tha the traditioal kerels. I these experimets the left ad right cotext legth was set to 50. Secod, we set up some experimets to fid the optimal left ad right cotext legth for each classifier. Third, we also looked at di- ad triucleotides to fid out whether this gave better performace tha the sigle ucleotide case. I the ext sectios, we describe the software ad the data sets used i our experimets, we discuss how we have set the differet parameters for the SVM, we preset ad discuss the results obtaied ad fially we give a overview of related work Software ad Data We did the experimets with LIBSVM [?], a Java/C++ library for SVM learig. The dataset we use i the experimets is a set of huma gees, which is referred to as HumGS [?]. Each istace is represeted by a fixed cotext size of 50 ucleotides before ad 50 ucleotides after the cadidate splice site based o the iitial desig strategy i [?]. Sice, we trai oe classifier to predict door sites ad aother classifier to predict acceptor sites, separate traiig ad test sets are costructed for door ad acceptor sites. For the purpose of traiig the classifiers, we costructed balaced traiig sets. For testig however we wat a reflectio of the real situatio ad keep the same ratio as give i the origial set HumGS. This is show i Table?? Parameter Selectio ad Accuracy Parameter selectio is doe by 5-fold cross validatio o the traiig set. For the ORBF, WRBF ad MRBF, there are two free parameters that eed to be optimized: the SVM cost parameterc (which is a trade-off for the model complexity ad the model accuracy) ad the radial basis fuctio parameter γ. Table 1. Overview of the data sets that have bee used for the splice site predictio experimets. data set gees GT+ GT AG+ AG HumGS traiig / testig / We performed a fie grid search for values of C ad betwee 2 16 ad 2 5. For the NODK, NWDK ad NMDK oly the cost parameter C has to be optimized because we choose fixed to 1 as this gives very good results, more precisely for β

10 10 Advaces i Systems Modellig ad ICT Applicatios = 2 results are ot good at all, other values have ot bee tried. Agai, values for C betwee 2 16 ad 2 5 have bee cosidered. The LINEAR kerel i the results below refers to the the overlap kerel from Defiitio?? with d = 1 ad c = 0. For the SOK ad the WOK we take d = 2 ad c = 0 as previous work poited out that higher values for d actually leads to bad results, while takig values for c > 0 does ot have a sigificat impact o the results. As a weightig scheme for the weighted kerels, we used three differet weights: Iformatio Gai (IG), Gai Ratio (GR) ad Shared Variace (SV). For more details the reader is referred to [?]. Splice site predictio systems are ofte evaluated by meas of the percetage of FP classificatios at a particular recall rate. This measure is referred to as FP% [?] ad is calculated as follows: FP% = #false positives # false positives + # true egatives x 00 We used this evaluatio measure for a recall rate of 95%, i this case the measure is referred to as FP95%, i.e. the FP95% measure gives the percetage of the predictios falsely classified as actual splice site at a level where the system has foud 95% of all actual splice sites i the test set. Note that it is the purpose to have FP95% as low as possible Results Table?? gives a overview of the fial FP95% results ad model complexity i terms of the umber of support vectors of the differet kerels o the splice site predictio task. Note that the cofidece itervals have bee obtaied by bootstrap resamplig, at a cofidece level = 0.05 [?]. A FP95% rate outside of these itervals is assumed to be sigificatly differet from the related FP95% rate at a cofidece level of = I additio to the fial FP95% results we also give as a illustratio two FP% plots, for door sites, comparig the cotext-sesitive kerels with those kerels that are ot cotext-sesitive. Figure?? does this for the egative distace kerel makig use of the MVDM ad Figure?? does this for the radial basis fuctio kerel makig use of the WOK with GR, IG ad SV weights. 10

11 Key Note 1: Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels 11 Table 2. Splice site predictio, results for all kerels, for door sites ad for acceptor sites. door sites acceptor sites Kerel ad Weights FP95% #S Vs FP95% #S Vs LINEAR 8.18 ± ± LINEAR/GR 7.86 ± ± LINEAR/IG 7.92 ± ± LINEAR/SV 7.88 ± ± SOK 7.19 ± ± WOK/GR 6.51 ± ± WOK/IG 6.38 ± ± WOK/SV 6.43 ± ± NODK 7.97 ± ± NWDK/GR 6.43 ± ± NWDK/IG 6.40 ± ± NWDK/SV 6.38 ± ± NMDK (r = 1) 6.26 ± ± NMDK (r = 2) 6.38 ± ± ORBF 7.46 ± ± WRBF/GR 6.25 ± ± WRBF/IG 6.21 ± ± WRBF/SV 6.27 ± ± MRBF (r = 1) 5.81 ± ± MRBF (r = 2) 6.40 ± ± From the results it ca be easily see that i all cases the cotext-sesitive kerels makig use of the WOM with IG, GR ad SV weights ad the MVDM always outperform their simple o-sesitive couterparts both i accuracy ad i model complexity. Moreover i almost all cases this happes with a sigificat differece. There is however oe exceptio, i.e. the MRBF with r = 2 for acceptor sites performs worse tha its o-sesitive couterpart. Aother overall observatio is that the differece i the results betwee differet cotext weights is ot sigificat at all. Fially, it ca be see that the best result for door sites is obtaied by the MRBF with r = 1, but for acceptor sites this is the WRBF with IG weights. Therefore it is clear that the success of the used metric ad the used weights depeds, for a great deal, o the properties of the data uder cosideratio so that it is worthwhile tryig differet metrics ad differet cotext weights to see which oe gives the best result. 11

12 12 Advaces i Systems Modellig ad ICT Applicatios Fially, if oe would like to use the LINEAR kerel i favour of classificatio speed but at the cost of some accuracy, it ca be see from the results that the weighted LINEAR kerel outperforms its uweighted couterpart, although the differece is ot sigificat at a cofidece level = Nevertheless, it ca be see that the umber of support vectors is sigificatly lower tha for the uweighted LINEAR kerel ad this will result i faster classificatio, because classificatio of a ew istace happes by comparig it with every support vector i the model through the kerel fuctio K. Next, we look at the experimets to fid the optimal left ad right cotext legth for each classifier. The, we look at di- ad triucleotides to fid out whether this gave better performace tha the sigle ucleotide case. For these experimets we used the WRBF/GR kerel, this choice was based o the fact that WRBF performs secod best for door sites ad best for acceptor sites. Moreover, sice IG(iformatio gai), GR(gai ratio) ad SV(shared variace) were ot sigificatly differet i our experimets we used GR (gai ratio) as the weightig scheme. This follows from the experimets described above. The results are show i Table??. Table 3. Splice site predictio, for the WRBF/GR kerel, for door sites ad for acceptor sites. door sites acceptor sites r. ucleotides FP95% left cotext right cotext FP95% left cotext right cotext Sigle ucleotide 6.54 ± ± Diucleotide 5.52 ± ± Triucleotide 5.87 ± ± Related Work The umber of papers o splice site predictio ad the related problem of gee fidig is eormous ad hece it is impossible to give a exhaustive overview. We will give some popular refereces (accordig to citeseer) ad discuss some recet work. The problem of recogizig sigals i geomic sequeces by computer aalysis was pioeered by Stade [?] ad the recogitio of splice sites usig eural etworks was first addressed by Bruak et al. [?]. They traied a backpropagatio feedforward eural etwork with oe layer of hidde uits to recogize door ad acceptor sites, respectively. The iput cosist of a slidig widow cetered o the ucleotide for which a predictio is to be made. The widow is ecoded as a umerical vector. The best results were obtaied by combiig a eural etwork to recogize the cosesus sigal at the splice site with aother oe that predicted codig regios based o the statistical properties of the codo usage ad preferece. This tool is available olie at

13 Key Note 1: Efficiet Splice Site Predictio with Cotext-Sesitive Distace Kerels 13 Kulp et al. [?] ad Reese et al. [?] build upo the work of Bruak by explicitely takig ito accout the correlatios betwee eighborig ucleotides aroud a splice site by usig diucleotides as iput features istead of sigle ucleotides. This tool is available olie at Geesplicer [?] uses a combiatio of a hidde Markov model ad a decisio tree. They obtaied good performace compared to other leadig splice site detectors at that time. Rätsch ad Soeburg [?] use a SVM with a special kerel to classify ucleotides as either door or acceptor sites. There is oe SVM for door sites ad oe for acceptor sites. The system predicts the correct splice form for more tha 92% of these gees. This approach is quite similar to ours but the kerel is differet. Fially, a list of olie tools for splice site predictio ad gee fidig is available at 5. Coclusios I this article it was show how differet statistical measures ad distace fuctios ca be icluded ito kerel fuctios for SVM learig i cotextdepedet classificatio tasks. The purpose of this approach is to make the kerels sesitive to the amout of iformatio that is preset i the cotexts. More precisely, the case of splice site predictio has bee discussed ad from the experimetal results it became clear that the sesitivity iformatio has a positive effect o the results. So far, this was show o oly oe data set because the SVM is computatioally very expesive but we have show that kerel fuctios that operate o cotexts directly gives additioal beefits. At the momet, we are ruig experimets o a umber of other data sets to show that the icreased performace is ot due to bias to the data sets. Apart from that, we are ruig experimets with more complex features based o the improved desig strategy i [?], where a FP95% rate of 2.2% for door ad 2.9% for acceptor sites is obtaied. I this light it remais to be see whether the positive effect of the sesitivity iformatio will still be sigificat i a system that already performs at very high precisio without such iformatio. Fially, we pla to compare our results with the oes obtaied by other classifiers o the same data sets.

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it