Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

Size: px

Start display at page:

Download "Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers"

Diane Jenkins
5 years ago
Views:

1 Predictive Vaccinology: Optiisation of Predictions Using Support Vector Machine Classifiers Ivana Bozic,2, Guang Lan Zhang 2,3, and Vladiir Brusic 2,4 Faculty of Matheatics, University of Belgrade, Belgrade, Serbia 2 Institute for Infoco Research, 2 Heng Mui Keng Terrace, Singapore 963 {guanglan,vladiir}@i2r.a-star.edu.sg 3 School of Coputer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore School of Land and Food Sciences and the Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia Abstract. Proiscuous huan leukocyte antigen (HLA) binding peptides are ideal targets for vaccine developent. Existing coputational odels for prediction of proiscuous peptides used hidden Markov odels and artificial neural networks as prediction algoriths. We report a syste based on support vector achines that outperfors previously published ethods. Preliinary testing showed that it can predict peptides binding to HLA-A2 and -A3 supertype olecules with excellent accuracy, even for olecules where no binding data are currently available. Introduction Coputational predictions of peptides (short sequences of aino acids) that bind huan leukocyte antigens (HLA) olecules of the iune syste are essential for designing vaccines and iunotherapies against cancer, infectious disease and autoiunity []. There are ore than 800 different HLA olecules characterised to date, and any of the have unique peptide binding preferences. A HLA supertype is a group of HLA olecules that share siilar olecular structure and also have siilar binding preferences they bind largely overlapping sets of peptides [2]. Soe dozen class I HLA supertypes have been described, of which four aor supertypes (HLA- A2, -A3, -B7, and -B44) are present in approxiately 90% of the huan population [2]. Prediction of peptide binding to a single HLA olecule is less relevant for the design of vaccines applicable to a large proportion of huan population. However, predictions of proiscuous peptides, i.e. those that bind ultiple HLA olecules within a supertype, are iportant in the developent of vaccines that are relevant to a broader population. Identification of peptides that bind HLA olecules is a cobinatorial proble. Therefore high accuracy of such predictions is of great iportance, because it akes identification of vaccine targets ore cost- and tie-effective. A large nuber of prediction ethods have been used for identification of HLA binding peptides. They include [reviewed in 3] binding otifs, quantitative atrices, decision trees, artificial neural networks, hidden Markov odels, and olecular odeling. In addition, it was reported that odels using support vector achines (SVM) perfor better than other prediction ethods when applied to a single HLA olecule [4,5]. Predictions of peptide binding to ultiple HLA olecules of a supertype were M. Gallagher, J. Hogan, and F. Maire (Eds.): IDEAL 2005, LNCS 3578, pp , Springer-Verlag Berlin Heidelberg 2005

2 376 Ivana Bozic, Guang Lan Zhang, and Vladiir Brusic perfored using hidden Markov odels (HMM) [6] and artificial neural networks (ANN) [7]. These reports also showed that accurate predictions proiscuous peptides within a supertype can be perfored for HLA variants for which no experiental data are available. The accuracy of predictions of peptide binding to ultiple HLA variants within A2 and A3 supertypes was easured by the area under the receiver operating characteristic curve (A ROC ) [8]. The reported values were 0.85< A ROC <0.90 for A2. We developed SVM odels for prediction of peptide binding to olecules of HLA-A2 and -A3 supertypes and copared their prediction power to the corresponding ANN and HMM odels. SVM odels use a achine learning technique that ipleents structural risk iniization principle, which iniizes the upper bound on the expected generalization error thus giving high generalization ability [9]. SVMs exhibit great resistance to over-fitting. In addition, training of SVMs is a convex prograing proble it always finds a global solution, in contrast to training of ANNs, where any local inia usually exist. 2 Data Preprocessing Nine aino acids long (9-er) peptide data were collected fro MHCPEP database [0], published articles, and a set of HLA non-binding peptides (V. Brusic unpublished data). Our datasets contain peptides with experientally verified binding affinities (binder or non-binder) for fifteen variants of HLA-A2 and eight variants of HLA- A3 supertype. Data used for training represents the interaction (contact aino acids) between a 9-er peptide and a HLA olecule. A peptide/hla interaction is represented as a virtual peptide coprising nine peptide residues plus HLA residues that coe in contact with the peptide. Only those contact residues that vary across the olecules belonging to one supertype were considered for the analysis. There are 9 non-conserved residues across A2 supertype and 4 non-conserved contact residues across A3 supertype olecules. The representation for each non-conserved residue is longer since it is a binary vector whose length equals the nuber of different residues observed at each contact position and only one position can take value. A virtual peptide for A2 supertype coprises 58 aino acids and for A3 62 aino acids. Detailed explanation of the data representation is given in [6,7,]. If A is a set of 20 aino-acids, a virtual peptide is defined as (a, a 2,,a l ), a A, {,..., l}, l = 58 for A2 and l = 62 for A3 supertype. The HLA protein sequences used in the processing of virtual peptides were extracted fro HLA Sequences Data release in the IMGT/HLA Sequence Database ( This representation enables binding peptide predictions for different olecules of the entire supertype. 3 Description of the Method 3. Support Vector Machines Let X be a training set of virtual peptides and Y be a set of their experientally deterined binding affinities. Each eber of X can have one of the two possible binding affinities ( if it is a binder, - if it is not); X = {x,, x }, x R and Y n =

3 Predictive Vaccinology: Optiisation of Predictions 377 {y,,y }, y {, }, {,..., }. We seek to find a classification function f : R n R with the property: f ( x) > 0 y = and f ( x ) < 0 y =, which would accurately predict the classes of unseen data points. The value ρ : = in f ( x ) y () is called the argin and stands for the worst classification over the whole training set. Training exaples that lie right on the argin are called support vectors. If the training data are linearly separable, we seek to find a linear function f (x) that has the axiu argin. This is equivalent to constructing a axiu-argin hyperplane that separates the two classes (binders and non-binders) in n-diensional space of our training data. In the case of linearly non-separable data, the idea is to ap the training data into a higher-diension feature space F (Hilbert space) via a non-linear ap Φ : R n F, di F > n, and construct a separating axiu-argin hyperplane there. This apping is siplified by introducing a kernel function k: k ( x, x' ) = ( Φ( x), Φ( x' )). (2) Because real-life data are usually non-separable (even in the feature space) due to noise, we allow soe isclassified training exaples. This is enabled by introduction of a new paraeter C, which is a trade-off between the argin and the training error. Constructing a axiu-argin hyperplane is a constrained convex optiization proble. Thus, solving the SVM proble is equivalent to finding a solution (see [2]) to the Karush-Kuhn-Tucker (KKT) conditions, and, equivalently, the Wolfe dual proble: Find α which axiize W(α), W ( α) = = 0 α α 2 C,, s= α α y = s y s k α y = 0. ( x, x ) This is a linearly constrained convex quadratic progra, which is solved nuerically. The decision function is = f ( x) = y α k( x, x ) + b. (4) α i are the solutions of the corresponding quadratic progra and b is easily found using the KKT copleentarity condition [2]. Coonly used kernels include linear, Gaussian and polynoial. s, (3) 3.2 Ipleentation We used the SVMlight package with a quadratic prograing tool for solving sall interediate quadratic prograing probles, based on the ethod of Hildreth and D'Espo [3]. Prior to training, every peptide fro our dataset was transfored into a

4 378 Ivana Bozic, Guang Lan Zhang, and Vladiir Brusic virtual peptide. This data were then transfored into a forat copatible with the package used, and every virtual peptide (a,a 2,,a l ) was translated into a set of n = 20 l indicators (bits). Each aino acid of a virtual peptide is represented by 20 indicators. In the coplete sequence of indicators, l indicators are set to, representing specific residues that are present at a given position, and the other 20 (l-) indicators are set to 0. In our case, x = ( i, i2,..., in ), i s {0, }, s {,..., n}, {,..., } represents a virtual peptide. The values y = ± indicate whether a peptide binds () or does not bind (-) to a HLA olecule. Blind testing was perfored for assessing the perforance of SVM for prediction of proiscuous peptides. To test the predictive accuracy of peptide binding to each of the HLA-A2 and -A3 variants, we used all peptides (binders and non-binders) related to this variant as the testing data and used all peptides related to other variants fro the sae supertype as training data. For exaple, the training set of HLA-A*020 contained all peptides related to all HLA-A2 olecules except for HLA-A*020. Testing of peptide binding to each HLA variant was perfored without inclusion of experiental data for this particular HLA variant in the training set. Testing results, therefore, are likely to represent an underestiate of the actual perforance since the final odel contains available data for all HLA variants. We perfored blind testing on five HLA-A2 and seven HLA-A3 olecules, for which sufficient data were available for valid testing (Table ). Other variants of HLA-A2 and -A3 olecules were excluded fro testing, since there was insufficient data for generating adequate test sets. Throughout blind testing, we exained three kernels (linear, Gaussian and polynoial) and various cobinations of the SVM paraeters (trade-off c, σ for Gaussian and d for polynoial kernel). We trained 50 different SVM odels for each supertype, with c varying fro 0.0 to 20, d fro to 0 and σ fro 0.00 to. Models with the best prediction perforance (highest average A ROC ) were Gaussian kernel with σ=0., c=0.5 for HLA-A2 and Gaussian kernel with σ=0. and c=2 for HLA-A3. Table. Blind testing: nuber of peptides in training and test sets for A2 and A3 supertype. Each training data set contained peptides related to other * 4 HLA-A2 or ** 7 HLA-A3 olecule variants HLA-A2 * Training data Test data HLA-A3 ** Training data Test data olecule Binders Nonbinders Binders Nonbinders olecule Binders Nonbinders Binders Nonbinders * * * * * * * * * * * * Experiental Results A ROC values of the SVM odels that showed the best prediction perforance are shown in Table 2 along with the corresponding values for optiized HMM [6] and ANN [7] predictions. The average SVM prediction accuracy is excellent for both A2 (A ROC =0.89) and A3 supertype predictions (A ROC =0.92). SVMs perfored arginally

5 Predictive Vaccinology: Optiisation of Predictions 379 better than HMM and ANN prediction ethods on A2 supertype olecules, and significantly better on A3 supertype olecules (p<0.05, Student s t-test). Table 2. A ROC values for blind testing of A2 and A3 odels; coparison with HMM and ANN HLA-A2 SVM HMM ANN HLA-A3 SVM ANN HMM olecule olecule * * * * * * * * * * * * Average Average Std.dev Std.dev Coparison of predictive perforances of all three kernels is shown in Table 3. For both A2 and A3 supertypes, the prediction odels with Gaussian kernel deostrated the best overall perforance. A ROC values for odels with polynoial kernel are very close to those of Gaussian kernels. Models using linear kernel had the lowest average A ROC on predictions for both supertypes. Table 3. Perforance coparison of SVMs with linear, polynoial and Gaussian kernels on a) HLA-A2 and b) HLA-A3. The tables show A ROC values for blind testing. The values where all three kernels perfored siilarly are shown in bold a) HLA-A2 olecule Kernel Linear Gaussian Polynoial A* A* A* A* A* Average b) HLA-A3 olecule Kernel Linear Gaussian Polynoial A* A* A* A* A* A* A* Average Discussion and Conclusion Gaussian kernel perfored best on both HLA-A2 and -A3 supertype predictions, while the linear kernel had the lowest average A ROC. This is in contrast to report [4] where the linear kernel showed the best perforance. However, the odel in [4] was

6 380 Ivana Bozic, Guang Lan Zhang, and Vladiir Brusic trained only for a single HLA olecule (A*020) and their training dataset contained saller nuber of peptides (total of 203). It is possible that linear kernel perfors well on sall datasets, because there ay not be sufficient data for fine-tuning of the boundary. Selection of an optial kernel for prediction of HLA binding peptides is dependant on the training data (both the data set and the natural properties of the studied HLA olecule). In the report [5] each of the three kernels (linear, Gaussian and polynoial) was deterined as optial for soe of the HLA olecules. In this study we report that although Gaussian kernel had the highest average A ROC, all three kernels have coparable perforances for prediction of peptide binding to olecules belonging to a HLA supertype. Possible iproveent ay be achieved by cobining predictions by SVMs that use different kernels. For exaple, iproved HLA-A3 predictions can be achieved by cobining Gaussian (e.g. A*030, 0302, 0, 02) and polynoial (e.g. A*30, 330, 680) kernels. Further experiental data are needed for validation of this hypothesis. Although linear kernel perfored best in soe cases (e.g. A*020 and 680), the iproveent relative to other kernels is iniscule. SVM showed significantly better perforance than HMM and ANN on A3 supertype olecules, but the sae conclusion could not be drawn for A2 supertype. This is probably due to the ibalance of the A2 training data with data for A*020 being significantly larger than any other A2 set. It was reported that SVM often do not have the highest accuracy on ibalanced datasets [4], and dataset for A2 supertype is ibalanced in two ways: binders/non-binders ratio is close to :4, and peptides related to A*020 constitute 6/7; peptides related to all other variants constitute only /7 of the data set. We have developed SVM odels for prediction of peptides that bind olecules belonging to HLA-A2 and -A3 supertypes. Predictions were of excellent overall accuracy for both odels. In coparison to other ethods for prediction of proiscuous peptides, HMM and ANN, SVM slightly outperfor HMM and ANN on unbalanced data sets (A2), but significantly outperfor the on balanced data sets (A3). Acknowledgeents This proect has been funded in part (ZG and VB) with the USA Federal funds fro the NIAID, NIH, Departent of Health and Huan Services, under Grant No. 5 U9 AI5654 and Contract No. HHSN C. References. Brusic,V., August,J.T.: The changing field of vaccine developent in the genoics era. Pharacogenoics 5 (2004) Sidney, J. et al: Practical, biocheical and evolutionary iplications of the discovery of HLA class I superotifs. Iunol. Today 7 (996) Brusic,V., Baic,V.B., Petrovsky,N.: Coputational ethods for prediction of T-cell epitopes-a fraework for odelling, testing, and applications. Methods 34 (2004) Zhao,Y. et al.: Application of support vector achines for T-cell epitopes prediction. Bioinforatics 9 (2003) Donnes,P. and Elofsson,A.: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinforatics 3 (2002) 25

7 Predictive Vaccinology: Optiisation of Predictions Brusic,V. et al.: Prediction of proiscuous peptides that bind HLA class I olecules, Iunol. Cell Biol. 80 (2002) Zhang,G.L. et al.: Neural odels for predicting viral vaccine targets. J. Bioinfor. Cop. Biol. (In press) 8. Swets,J.: Measuring the accuracy of diagnostic systes. Science 240 (988) Vapnik,V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (995) 0. Brusic,V. et al.: MHCPEP, a database of MHC-binding peptides: update 997. Nucleic Acids Res. 26 (998) Chelvanayaga,G.: A roadap for HLA-A, HLA-B and HLA-C peptide binding specificities. Iunogenetics 45 (996) Burges,C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (998) Joachis,T.: Making Large-Scale SVM Learning Practical. Advances in Kernel Methods Support Vector Learning. MIT Press, Cabridge (999) 4. Wu,G. Chang,E.Y.: Adaptive feature-space conforal transforation for ibalanced-data learning. Proceedings of the 20 th ICML, Washington DC (2003)

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic