Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4. University, Ames, Iowa, 50010, USA. Ames, Iowa, 50010, USA

Size: px

Start display at page:

Download "Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4. University, Ames, Iowa, 50010, USA. Ames, Iowa, 50010, USA"

Julie Gray
6 years ago
Views:

1 Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families: Combining of Function Annotation Using Sequence Homology With that Based on Amino Acid k-gram Composition Yields More Accurate Classifiers Than Either of the Individual Approaches Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4 1 Artificial Intelligence Laboratory, Department of Computer Science, Iowa State University, Ames, Iowa, 50010, USA 2 Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa, 50010, USA 3 Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa, 50010, USA 4 Center for Computational Intelligence, Learning, and Discovery Corresponding author addresses: CA: andorfc@cs.iastate.edu AS: silvescu@cs.iastate.edu DD: ddobbs@iastate.edu VH: honavar@cs.iastate.edu - 1 -

2 Abstract Background Assigning putative functions to novel proteins and discovering how sequence correlates with protein function are important challenges in bioinformatics. We explore several machine learning approaches to data-driven construction of classifiers for assigning protein sequences to appropriate Gene Ontology (GO) functional families using a class conditional probabilistic representation of amino acid sequences. Specifically, we represent protein sequences using a class conditional probability distribution of amino acids (amino acid composition) or short (k-letter) subsequences (k-grams) of amino acids. We compare several sequence-based classifiers for assigning putative functions to proteins including: NB k-grams, which ignores the statistical dependencies among overlapping k-grams; NB(k), which models such dependencies; SVM k-grams, a support vector machine (SVM) classifier trained on amino acid k-gram composition-based representation of protein sequences, an approach that combines the outputs of the classifiers based on amino acid k-gram composition using a two-stage decision tree classifier (DTree) and an approach that combines function annotation based on sequence similarity (obtained using PSI-BLAST) with that based on amino acid k-gram composition using a twostage decision tree classifier (HDTree). Results We report the performance of NB k-grams, NB(k), SVM k-grams, PSI-BLAST, DTree and HDTree classifiers on data sets of three functional families from the Gene Ontology (GO). Each of the proposed methods is effective in correctly assigning GO function categories to protein sequences, even when they share little sequence identity (10% or less) with sequences of known function. The performance of each method is measured in terms of accuracy of classification, as measured by cross-validation. Of the four 1-stage classifiers, NB k-gram outperformed (or tied for best performance) on 11 of the 21 data sets, NB(k) on 5 of the 21 data sets, PSI-BLAST on 5 of the 21 data sets, and SVM k-grams on 3 of the 21 data sets. For the aforementioned results, the optimal value of k was known. The optimal value is not constant for each data set and can only be experimentally determined. This was the motivation for DTree. By combining the outputs for NB k-gram and NB(k) for all values of k, we were able to improve classification performance on 6 of the data sets and have more consistent results over all data sets then any classifier with a fixed value of k. These results significantly improve when we add the complimentary results of PSI-BLAST. The two-stage decision tree classifier HDTree outperforms DTree, NB k-gram, NB(k), PSI-BLAST and SVM k-gram on a vast majority (18 out of 21) of the data sets. This ratio becomes 15 out of 15 data sets when the sequence identity is in the range of 10% to 90%. We show how the likelihood ratios of k-gram occurrences for a specific protein functional class can be used to identify amino acid motifs that are reliable predictors of the corresponding functional class. Motifs identified using this approach were, in several cases, shown to correspond to experimentally-determined active sites and other functional motifs

3 Conclusions We have shown that amino acid k-gram compositions of protein sequences offer an inexpensive, yet highly effective source of information for predicting the GO function annotations of proteins. Our experimental results demonstrate the feasibility of using machine learning approaches that require only the amino acid k-gram compositions to automatically and reliably generate GO function annotations of protein sequences, even in cases where the sequence identity between the query sequence and the data set of annotated proteins (training set) is extremely low. Our results show that by combining amino acid k-gram based protein function classifiers with function annotation based on sequence homology (using PSI-BLAST), we can build very strong classifiers whose accuracies reach well over 90%. Our results also suggest that this approach can be extended to the efficient computational identification of potentially functionally significant sequence motifs, without the need for computationally expensive sequence alignment. Background Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters and molecular machines in cells. Experimental determination of protein structure and function significantly lags behind the rate of growth of protein sequence databases. This situation is likely to continue for the foreseeable future. Hence, assigning putative functions to proteins based on sequences alone remains one of the most challenging problems in functional genomics [1]. Improvements in annotating protein sequences can be expected to yield significant improvements in gene annotations. Against this background, there has been a great deal of interest in the development of automated annotation of protein sequences with Gene Ontology (GO) [2] function labels. One class of sequence-based approaches for functional annotation relies on the comparison of the sequence in question to other sequences in a database of sequences with known function. Functional assignment is made by transference of function whenever sequences are sufficiently similar. A commonly employed notion of similarity is based on estimated sequence homology using programs such as BLAST and its derivatives [3]. Sequence searches often return multiple hits, so significant human expertise is needed for interpreting results. The reliability of this approach rapidly drops once the pair-wise sequence identity drops below 30 percent [4]. A second class of sequence-based approaches for assigning putative functions to protein sequences rely on the detection of sequence patterns. (Several automated tools for identifying conserved sequence patterns from a given set of sequences e.g., e-motif and e-matrix [5, 6], MEME [7], are available.) Motif databases can be queried using a protein sequence to obtain a list of conserved sequence patterns found in the sequence as well as functions associated with the respective patterns. The results can be used to assign putative functions to the protein sequence. In the case of protein families having sufficient numbers of well-characterized members, data mining approaches rooted in statistical inference and machine learning [8] offer an attractive and costeffective approach to automated construction of classifiers for assigning putative functions to novel protein sequences. In essence, the data mining approach uses a representative training data set that encodes information about proteins with known functions to build a classifier for assigning proteins to one of the functional families - 3 -

4 represented in the training set (and if necessary, a default class indicating unknown function). The resulting classifier can then be used to assign novel protein sequences to one of the protein families represented in the training set after it has been validated using an independent test set (which was not used to build the classifier). Recent work by our group [9, 10] has explored the use of machine learning approaches to automated construction of such classifiers. In this paper, we explore machine learning approaches for prediction of GO functional categories of protein sequences, with emphasis on methods that utilize only the readily available amino acid sequences for the target proteins. We are especially interested in the effectiveness of such methods in reliably assigning GO functions to protein sequences that share little sequence identity with sequences that have experimentally-validated GO function labels. We are also interested in methods that can not only predict the GO function categories of target proteins, but also help identify putative functional sites, using only primary sequence information. We compare four different methods that use class conditional probabilities of k-grams k-letter subsequences to represent amino acid sequences. The first method uses a Naive Bayes classifier that treats each amino acid sequence as if it were simply a bag of amino acids. The second method (NB k-grams) applies the Naive Bayes classifier to a bag of k-grams (k > 1). Note that NB k-gram violates the Naive Bayes assumption of independence in an obvious fashion: neighbouring k-grams overlap along the sequence; adjacent k-grams have k-1 elements in common. The third method overcomes this problem by constructing an undirected graphical probabilistic model for k-grams [11, 12], which explicitly models the dependencies among overlapping k- grams in a sequence. We train one such model per functional family. During classification, just as in the case of the Naive Bayes classifier, the sequence to be classified is assigned to the class that has the largest posterior probability given the sequence. We call the resulting classifier NB(k) to denote the fact that it models dependencies among k adjacent elements of sequences. Note that NB(1) is equivalent to NB 1-grams, which in turn is equivalent to the Naive Bayes (NB) classifier. Our fourth method applies a support vector machine (SVM) [13,14] to classify amino acid sequences represented using class conditional probability distributions of k- grams in the sequence. SVMs have recently been applied successfully to many problems in computational biology including protein function classification [15], protein subcellular localization [16,17,18,19,20,21,22], and identification of proteinprotein interaction sites from sequences [23]. Methods similar to our SVM k-grams method have been independently developed and applied to the task of predicting subcellular localization of proteins [21,22]. We compare two two-stage hybrid methods that use the output of the previous methods as input to the classifier. Two-stage classifiers have been previously shown to boost performance. [18,21,24] Both these methods use a simple decision tree algorithm [25, 26] to build classifiers. The first method uses the output from the NB k-gram and NB(k) classifiers for k values of 2, 3, and 4. The second method uses the same input in addition to the output of a PSI-BLAST classifier. We have shown previously [10] that NB k-gram and NB(k) can produce very reliable classifiers. We showed that neither NB k-gram nor NB(k) consistently outperformed each other. Also, there was no fixed value of k that was optimal for all data sets. By combining the outputs of these methods, we are combining overlapping but complementary information. This increases the flexibility and predictability of our algorithm. PSI

5 BLAST adds additional complementary information. The other methods are based on the k-gram composition of proteins; PSI-BLAST is a homology based tool that uses sequence alignment and does well when proteins have high sequence similarity. The goal of our new methods is combine the advantages of tools based on a range of k- gram composition with tools based on sequence homology. NB, NB(k) and NB k-grams classifiers have the advantage of training with only one pass through the training data and hence lend themselves to incremental update as new training data become available. SVMs, on the other hand, often achieve higher classification accuracy than is achievable using algorithms that make a single pass through the training data by optimizing the classifiers to trade-off the complexity of the classifiers against accuracy on training data (a process referred to as regularization in the machine learning literature). The increased accuracy comes at the expense of increased computational requirements. On a large data set, training an SVM classifier typically takes orders of magnitude more time compared to NB(k) and NB k-grams, ruling out their use in cases where it is necessary to update the classifiers frequently as new training data become available. Hence, it is of interest to compare the performance of SVM k-grams classifiers with computationally less expensive alternatives that such as NB k-grams and NB(k) that lend themselves to incremental update as new training data become available. We compare Naive Bayes, NB k-grams, NB(k), SVM k-grams classifiers with our two-stage classifiers for assigning protein sequences to the corresponding GO (the Gene Ontology [2]) functional families. The sequence data sets used in our experiments were extracted from SwissProt [27]. In our experiments when comparing classifiers using the same k-value, the NB k-gram classifier outperformed (in terms of classification accuracy), the standard Naive Bayes classifier by a large margin (20% - 40%), the NB(k) classifier outperformed NB k-grams classifier by a few percentage points, and SVM k-grams outperformed NB(k) in a large majority of the test cases. It is worth noting that among the classifiers that performed the best regardless of the value of k, NB k-gram or NB(k) outperformed the other methods on 13 of the 21 test cases and 10 of the 15 test cases on data sets with sequence similarity ranging from 10% 90%. Classification performance could be further improved by using the twostage approach. The overall accuracy was improved in 6 of the 21 datasets when DTree was used and in 18 of the 21 data sets when HDTree was used. In some cases the improvement was over 10%. Results and Evaluation Data Sets Data set 1 (Kinase data set) was derived from families of yeast and human kinases. These families were chosen for this study because many of them are wellcharacterized, with known structures and functions. The data set used in this study consists of 288 proteins belonging to the Gene Ontology functional family GO , Protein Kinase Activity. We classified them according to three GO groups just below GO in the hierarchy: GO , protein serine/threonine kinase activity (209 proteins); GO , protein-tyrosine kinase activity (69 proteins); GO , and protein threonine/tyrosine kinase activity (10 proteins). Because GO is represented as a directed acyclic graph, some proteins may be represented in multiple classifications

6 Data set 2 (Kinase/Ligase data set) is derived from two subfamilies of GO , Catalytic Activity. This division is at a higher level of the GO hierarchy and consists of 376 proteins belonging to two functional families: GO , Protein Kinase Activity (158 proteins) and GO001684, Protein Ligase Activity (218 proteins). Data set 3 (Kinase/Ligase/Helicase/Isomerase data set) is a superset of the second data set. It contains the kinases and ligases as well as members of two additional subfamilies of GO : GO , Protein Helicase Activity (110 proteins), and GO , Protein Isomerase Activity (86 proteins). This data set enables us to evaluate classifier performance on a larger number of protein classes and at a high level of the GO hierarchy. It includes a total of 572 proteins. Each data set was filtered to remove proteins that had multiple GO class labels to ensure that the classes are non-overlapping - a requirement for all of the methods considered in this paper, and most standard machine learning and statistical methods for classification. After the functional classes were extracted from GO, the data sets used in this study were obtained by extracting the corresponding sequences from SwissProt [27]. To examine the effect of sequence identity on the performance of classifiers, we created seven subsets from the three larger "functional class" data sets by clustering the sequences in each data set according to percentage sequence similarity among proteins using BLASTCLUST [28]. For example, for an identity cut-off of 50%, protein A belongs to a cluster if and only if there exists a protein B in the same cluster such that proteins A and B have at least 50% sequence identity within 90% of the sequence length. Any two proteins not in the same cluster have an identity score of less than 50%. Using this method, six subsets of the original data sets were created using identity scores of 100%, 90%, 70%, 50%, 30%, and 10%. For testing classifiers, one sequence from each cluster was chosen at random as a representative of that cluster and all other proteins in that cluster were removed from the data set. This procedure ensures that no test sequence can have a sequence identity percentage greater than the designated cut-off with any other sequence (of the same GO functional class) in the training set. The seventh subset of the original data set was created with an even more stringent identity criterion: for each data set we ran PSI-BLAST [28] with the data set against itself. We chose a sequence at random and removed any other sequence belonging to the same class that had a PSI-BLAST hit with an e-value of.0001 or higher. This procedure was repeated on the remaining sequences, in each case, eliminating from the data set all sequences that yielded a PSI-BLAST hit with an e-value of.0001 or higher, and terminating the process when no more sequences could be eliminated. We call the resulting data set unblastable, because PSI-BLASTing any sequence within the subset with any other sequence (with the same class) in the subset is guaranteed not to return a hit. Thus, by clustering the sequences within each of the three larger functional data sets into seven subsets, we generated a total of twenty-one data sets for testing the effect of sequence identity on the performance of classifiers

7 Experiments The computational experiments were motivated by the following questions: a) How do NB k-grams, NB(k), and SVM k-grams models compare with each other and against the baseline represented by Naïve Bayes (NB) classifier? b) How do they compare with PSI-BLAST on the same data sets? c) Can using a two-stage classifier improve performance? d) Will building a two-stage classifier based on NB k-gram, NB(k), and PSI- BLAST results further improve performance? e) What is the effect of k (which can be viewed as a measure of the complexity of the models in question) on the classification accuracy? f) What is the effect of sequence identity (within the data set used to train and test classifiers) on classifier performance? g) Can this approach provide a computationally efficient method for identifying possible functional motifs? NB k-grams and NB(k) models were constructed and evaluated on the twenty-one data sets for different choices of k from 1 to 4. Values of k larger than 4 were not considered because at higher values of k, there is insufficient data to obtain reliable probability estimates. The SVM k grams model, using a linear SVM kernel, was tested with values of k from 1 to 3. Higher values of k were not explored because of computational and memory requirements. The reported accuracy estimates are based on stratified 10-fold cross validation. Within the 10-fold cross validation experiments, most individual standard deviations for classifiers were under 1% and never exceeded 2%, thus there was little variability among the individual classifiers generated from different folds of the cross validation experiments. Tables 1, 2 and 3 show comparisons of results obtained with Naïve Bayes, NB k- gram, NB(k), SVM k-gram, the PSI-BLAST algorithms, and the two-stage classifiers in predicting membership in each of the three GO functional families used in this study. Results Classification performance results obtained using the Kinase data set (Data set 1) are shown in Table 1. When no sequence identity cutoff was used (or equivalently, identity cut off is set at 100%) we obtained a 63% classification accuracy using PSI- BLAST, a 66% accuracy using Naive Bayes alone and 83% using SVM 1-gram. By increasing the value of k to 2, accuracy increased to 82% for NB 2-grams, 89% with NB(2), and 91% for SVM 2-grams. For NB(2), this represents a 23% improvement over Naive Bayes and 7% improvement over NB 2-grams, and SVM 2-grams outperformed NB(2) by less than 3%. NB 3-grams and NB(3) had accuracies of 89.9% and 92.0% respectively, with NB(3) (accuracy of 92.0%) outperforming SVM 3-grams (accuracy of 89.6%). All three methods outperformed PSI-BLAST (accuracy of 89.3%) Increasing k to 4 yielded little overall improvement in classification accuracy on this data set. NB 4-grams improved by 3% and NB(4) had lower accuracy relative to NB(3). This can be explained by the fact that as k increases, the probability estimates become less and less reliable (as we run out of data). The classification accuracies for the Kinase data set over a wide range of sequence identity cut-offs are also shown in Table 1. The NB k-grams, NB(k), and SVM k

8 grams methods performed well over a wide range of sequence identities. For example, when k=3, NB 3-grams had accuracies ranging from 90% to 87% (a drop of only 3%) when sequence identity was reduced from 90% to 10%. Thus, the classifier is able to reliably assign function to query proteins that have very little sequence identity with annotated proteins in the training set. Similarly, NB(3) accuracy suffered only a 2.6% drop in accuracy when sequence identity fell from 90% to 10%, and for the same data sets, accuracy for the SVM 3-gram decreased by 7.2%. When sequence identity was less then 90%, NB k-gram classifier outperformed the SVM k-gram classifier on all 6 subsets of the Kinase data set and NB(k) outperformed the SVM k-gram classifier on 5 of the 6 subsets (with SVM outperforming NB(k) classifier by 0.1% in terms of classification accuracy for the Kinase data set with sequence identity cut-off of 90%). It is especially worth noting that the classifiers were effective on the most challenging data set that we examined: the unblastable data set on which the query sequence has no PSI-BLAST hits on the data (making transfer of annotation based on sequence identity using PSI-BLAST impossible). In contrast, NB 3-grams and NB(3) achieved accuracies of 91.9% each, and SVM 2-gram (the best performer among the SVM-k gram classifiers) had an accuracy of 88.7 on the same data set. Tables 2 and 3 show NB(k) or NB k-gram classifiers outperform SVM k-gram classifiers on the Kinase/Ligase (Table 2) and Kinase/Ligase/Isomerase/Helicase (Table 3) data sets at sequence identity cut-offs ranging from 10% to 90%. However, when the sequence identity cut-off is set to 100%, the SVM k-grams significantly outperforms NB(k), yielding 100% accuracy for the Kinase/Ligase data set and 92.8% accuracy for the Kinase/Ligase/Isomerase/Helicase data set. This represents an improvement of 8.6% and 10.2% respectively over the best accuracy of NB(k) in each of the data sets. It is also worth noting that on the unblastable subset of Kinase/Ligase data set, SVM 2-gram classifier with accuracy of 98.9% significantly outperforms NB, NB(k) and NB k-gram classifiers (with accuracies ranging from 74.4% to 76.6%). As expected, PSI-BLAST had relatively high classification accuracy on data sets corresponding to a sequence identity cut off of 100%. Interestingly, PSI-BLAST outperforms NB k-grams and NB(k) on Kinase/Ligase (Table 2) and Kinase/Ligase/Isomerase/Helicase (Table 3) data sets and SVM k-grams on the Kinase/Ligase/Isomerase/Helicase (Table 3) data set. In the case of data sets on which PSI-BLAST outperforms the other classifiers, almost one third of the sequences (111 out of 376 for Kinase/Ligase data set and 191 out of 572 in the case of Kinase/Ligase/Isomerase/Helicase) are nearly identical (with sequence identity greater than 90%). However, in the case of Kinase data set (Table 1) with sequence identity cutoff of 100%, the accuracy of PSI-BLAST (62.7%) is substantially worse than that of NB(k) (93% with k=4), NB k-gram (92% with k=2) and SVM k-gram (91.3% with k=3) classifiers. This is not surprising in light of the fact that only 7 out of 288 sequences in the Kinase data set have sequence identity over 90%. However, the classification accuracy obtained by PSI-BLAST dramatically decreases as the sequence identity between the query protein and the annotated sequences in the training set decreases. In the case of Kinase/Ligase (Table 2) data set classification accuracy for PSI-BLAST decreased to 82.3%, when the sequence identity cut-off was set to 90%. The accuracies for NB k-gram, NB(k), and SVM k-gram also decreased with decrease in sequence identity, but these methods were still able to achieve a classification accuracy of around 89.0% on Kinase/Ligase data set

9 It is especially worth noting that the NB k-gram and NB(k) classifiers outperformed the other classifiers on 11 out of 12 data sets for the Kinase and Kinase/Ligase data (See Tables 1-2) with sequence identity less than 90%, with SVM k-gram beating the others on the remaining data set. It is also worth noting that in the case of NB k- grams and NB(k) no single choice of k consistently outperforms all other choices of k. This raises the question as to whether it might be possible to further improve the results of such classifiers by using them in combination. To address this question, we constructed a decision tree classifier (DTree) that takes as input, the outputs of NB 2- gram, NB 3-gram, NB 4-gram and NB(2), NB(3), and NB(4) classifiers. Two decision trees built on the individual families of classifiers were also explored. One decision tree built on only the output of the NB k-gram family of classifiers and another decision tree built on only the output of the NB(k) family of classifiers were constructed. The performance accuracies of the resulting decision trees were 1 5% less then DTree so we did not report them in our results. On 6 of the 21 data sets the resulting decision tree classifier outperformed the best overall one-stage classifier from the set. (see Tables 1-3) On 17 of the 21 data sets, the decision tree s accuracy was above or within 1% of the best performing classifier from the set, and on 21 of the 21 data sets DTree s accuracy was higher than or within 3% of the best classifier from the set. Thus, our results suggest that it is beneficial to combine our classifiers in a two-stage approach. The result is a much more consistent classifier. Recall that on the 7 data sets extracted from the Kinase/Ligase/Helicase/Isomerase data, transfer of function annotation based on sequence homology turned out to be the most accurate method for assigning functions to proteins. Protein function assignments based on transfer of annotation from the top PSI-BLAST hits on the training data were more accurate than the function assignments produced by the other (amino acid k-gram composition based) methods on 6 of the 7 data sets. In light of these experimental results, it is natural to ask whether there is some benefit to be gained by combining both the sequence homology based tool such as PSI- BLAST with a classifiers trained on amino acid k-gram representations of protein sequences such as NB(k) or SVM k-grams. To answer this question, we constructed a two-stage classifier The DTree classifier was able to improve individual results some of the data sets and stay within 3% of the best individual classifier from the family of classifiers trained on amino acid k-gram representations of protein sequences on all of the data sets. We constructed a decision tree classifier (HDTree) that, takes as input, in addition to the inputs used by DTree, the function assignment based on sequence homology (obtained running PSI-BLAST on the training set). Our experiments show that HDTree outperforms DTree: Classifiers generated by HDTree outperform all the other methods on 18 of the 21 data sets. In the case of data sets with sequence identity cut-offs ranging from 10% to 90%, HDTree s overall accuracy was superior to that of other methods on 15 of the 15 data sets (100%). (see Tables 1-3) In a majority of the data sets this improvement was also significant. In the Kinase data sets the improvement in accuracy ranged from 2.1% (in the case of 100% sequence identity cutoff) to 4.2% (when sequence identity cut-off set to 10%) over NB k-gram and NB(k) classifiers and from 5.8% to 8.4% over the PSI-BLAST results. In the case of the Kinase/Ligase data set, the improvement in accuracy was over 8% for each of the data sets with sequence identity cutoffs ranging from 10% to - 9 -

10 90% for the NB k-gram and NB(k) classifiers and over 15% relative to the PSI- BLAST results. On the Kinase/Ligase/Isomerase/Helicase data sets the improvement in accuracy ranged from 13% to 17% over the NB k-gram and NB(k) classifiers and ranged from 1% (100% sequence identity cut-off) to 12% (30% sequence identity cutoff) over the PSI-BLAST results. In summary, the HDTree classifier which uses both amino acid k-gram composition as well as sequence homology to assign putative functions to proteins had the best overall classification accuracy of all the methods. Discussion #0,90/ 47 There has been some previous work using k-gram composition for protein sequence classification, including sequence-based assignment of putative functions to on proteins. Most of the focus has been on using amino acid composition (1-gram) [18,29,30,31,32] while other work has focused on using dipeptide composition (2- gram) [18,22,33,34,35] to predict protein subcellular localization. An SVM with a spectrum kernel to handle k-grams with k>2 has been reported [36], and k-grams have been used with a Naïve Bayes model for text classification problems [12,37]. Recently, a discriminatively trained version of the NB (k) classifier for sequence classification has been proposed [38]. Methods similar to our SVM k-grams have been independently developed and applied to the task of predicting subcellular localization of proteins [20,21]. In contrast, here we focus on prediction of GO functional labels. In the data sets used in this study, the class labels are mutually exclusive. However, many proteins are multi-functional. The development of effective methods for classification of data that are labelled with multiple, not necessarily mutually exclusive class labels or hierarchically structured class labels is largely an open problem in machine learning. However, some methods have recently been proposed for this problem [39,40,41]. Against this background, it would be interesting to extend the approaches explored in this paper to deal with hierarchically structured class labels. Several authors have recently explored the use of protein-protein interaction data [42, 43], gene expression data [44], and protein structural features [45] to develop methods for assigning putative GO function labels to proteins with unknown function. Against this background, systematic assessment of the utility of different types of information (relative to their cost) for automated GO function annotation of proteins represents an important future direction. Implications for Automated Sequence-Based GO Function Annotation Our results confirm the usefulness of classifiers using a class conditional probabilistic representation of amino acid sequences to predict GO functional families. It can also be useful to use PSI-BLAST for transfer of functional annotation to a query sequence based on sequence identity to proteins with known annotations when the level of sequence identity between the two is rather high. However, the accuracy of function annotations produced by PSI-BLAST can drop rapidly with decrease in sequence identity between the query sequence and the training set. In contrast, machine

11 learning methods that utilize amino acid k-gram compositions provide accurate functional annotations when proteins share a similar k-gram composition. NB k-grams and NB(k) outperform Naive Bayes in our experiments. In terms of accuracy, NB(k) and NB k-grams are very complimentary. There are many instances where NB(k), and NB k-grams respectfully outperformed each other on our test cases. Both of these machine learning methods consistently outperform PSI-BLAST on 2 of 3 GO functionally family, but PSI-BLAST significantly outperformed the two machine learning algorithms on the remaining GO functional family. When combining the results of NB(k), NB k-gram, and PSI-BLAST into a two-stage hybrid approach we are able to take advantage of detecting high similarity and high k-gram composition simultaneously in one unified classifier. NB k grams and NB(k) require only one pass through the data, which makes the resulting classifiers easy to construct and update as new data become available. In contrast, at present, there are no efficient algorithms for updating SVM classifiers to incorporate new data in an incremental fashion. This makes NB(k) an attractive alternative when using large data sets or data sets that are rapidly being updated or modified. PSI-BLAST can also be built incrementally. After an initial index has been build, you can query your sequence against the index and store the top score hit and its e-value. As new data appears, you query your sequence against this data and compare it to the previous top scoring e-value. If this new query has a smaller e-value then you replace your top scoring hit; if the query has a larger e-value then you move on to the next data. Finally, the second stage classifier is based on a simply decision tree algorithm [53,54]. The input to this classifier is only seven attributes and the resulting classifier can be built in seconds on a standard 32-bit machine. Therefore, our method is a very incrementally process that requires very minimal computational efforts yet works very well for automated GO functional annotation of protein sequences. Detecting Potential Functionally Significant Motifs from the Learned Classifiers In addition to being able to predict functional labels, the likelihood ratios based on the k-gram probabilities given a specific class can be used to identify specific motifs in protein sequences that may be significant for function. In several cases, we have noted that specific residues with top-ranking likelihood ratios correspond to positions in active sites or other functional motifs that have been previously identified by biochemical and genetic approaches. For example, in the Kinases, analysis of the likelihood ratios produced by the learned classifiers allowed us to identify the active site motif, HRDL, along with functional motifs, APE and DFG. [46,47] A fourth motif identified by the learned classifier, DIWSL and DVWSL, has also been experimentally determined to be a functional motif for kinases [48]. This fourth motif is located in close proximity to the three verified functional motifs within the folded protein structure. An example is shown in Figure 1a, in which all four motifs are mapped onto the three dimensional structure of a representative kinase, the Lymphocyte-Specific Kinase Lck [PDB: 1QPC], [49]. These three regions were experimentally determined to be important in Lck kinase function [46,47,48]. In our three examples the region DIWSL was always within close enough proximity to form contact regions with the active site HRDL and functional motif APE

12 In fact, the DIWSL motif forms several contacts (i.e., several amino acids in each motif have C carbons within 4 Ǻ of each other) with the HRDL and APE motifs. These motifs are illustrated on the structures of two other protein kinases, human cyclin-dependent kinase 2 [PDB: 1B38] and human serine/threonine kinase Pak1 [PDB: 1F3M] in Figure 1b and 1c, illustrating that this relationship is conserved in other kinase family members, as expected. An important advantage of this potential method for identifying functional protein sequence motifs is its lack of reliance on computationally expensive multiple sequence alignment. Additional studies are needed to evaluate the broader applicability of this proposed method for rapid sequence-based identification of functionally or structurally significant motifs in proteins. Future Directions Some directions for future work include: a) Further evaluation of the methods described here on a broader range of data sets; b) Direct comparison of the performance of sequence-based methods described here with methods that utilize structural information for query proteins (e.g., on cases drawn from structural genomics targets); c) Development of principled approaches to assigning a protein sequence simultaneously to multiple classes (in the case of multifunctional proteins); d) Assessment of the relative utility of other sources of information (e.g., expression data, interaction data, structural features) [42,43,44,45] for improving the accuracy of automated function annotation; e) Examination of the resulting classifiers to identify testable hypotheses concerning sequence correlates of protein function and to guide the design of experiments to validate such hypotheses. Conclusions The results presented in this paper show that amino acid k-gram compositions of sequences offer an inexpensive, yet highly effective, source of information for GO function annotation of protein sequences. Our results demonstrate the feasibility of developing fully automated and computationally efficient sequence-based approaches to automated functional annotation of proteins when they share very little sequence identity with previously annotated sequences. According to our results, this information is complementary to sequence homology and can be combined with PSI- BLAST results to be a flexible and powerful classifier that works well on a variety of data. They also suggest the possibility of identifying potentially functionally significant sequence motifs, without performing computationally expensive sequence alignment

13 Methods Classification Using a Probabilistic model Before outlining the two probabilistic models used for modelling the interactions among k consecutive elements in the sequence, we define a method to build a classifier associated with a probabilistic model. Suppose we have a probabilistic model α for sequences defined over some alphabet (which in our case is the 20-letter amino acid alphabet). The model α specifies for any sequence S = s1,..., sn the probability P α ( S = s1,..., sn ) from the probabilistic model using the following procedure: For each class c j. c train a probabilistic model α c ) using the sequence belonging to j Predict the classification c(s) of a novel sequence S = s,..., sn as given by: ( j c( S) = arg max Pα c ) ( S = s1,..., s c j C 1 ) P( c ( j n j Note that P S s,..., s c ) = P ( S = s,..., s ) therefore: j α ( = 1 n j α ( c ) 1 n c( S) = arg max Pα ( S = s1,..., sn c j ) P( c j ) c j C ) Naïve Bayes Classifier The Naïve Bayes classifier assumes that each element of the sequence is independent of the other elements given the class label. Consequently, n c( S) = arg max Pα Pα ( s c ) Pα ( s c j C i= 1 c ) P( c 1 j n j j Note that the Naive Bayes classifier for sequences treats each sequence as though it were simply a bag of letters. We now consider two Naive Bayes-like models based on k -grams. ) Naïve Bayes k-grams Classifier The Naive Bayes k-grams (NB k-grams) method uses a sliding a window of size k along each sequence to generate a bag of k-grams representation of the sequence. Much like in the case of the Naive Bayes classifier described above, it treats each k- gram in the bag to be independent of the others given the class label for the sequence. Given this probabilistic model, the previously outlined method for classification using a probabilistic model can be applied. The probability model associated with Naïve Bayes k-grams is: n k = si,..., Si+ k 1 = si+ k 1 c j ) P( c j c j C i= 1 Pα ( S = [ S = s,..., S n = sn ]) = arg max Pα Pα ( Si A problem with the NB k-grams approach is that successive k-grams extracted from a sequence share k-1 elements in common. This grossly and systematically violates the independence assumption of Naive Bayes. )

14 Naïve Bayes (k) We introduce the Naive Bayes (k) or the NB(k) model to explicitly model the dependencies that arise as a consequence of the overlap between successive k-grams in a sequence. Figure 2a shows the dependency model for a sequence of 5 elements. We represent the dependencies in a graphical form by drawing edges between the elements that are directly dependent on each other. The graph for pair wise dependencies is illustrated in Figure 2b and the one for 3-way dependency is depicted in Figure 2c. Using the Junction Tree Theorem for graphical models [50], it can be proved [51] that the correct probability model α that captures the dependencies among overlapping k- grams is given by: n i n i k+ 1 P ( S = s,..., S = 1 α i i i+ k 1 i+ k 1 α ( S = [ S1 = s1,..., S n = sn ]) = k+ 1 P P ( S = s,..., S = s = s = 2 α i i i+ k 2 i+ k 2 Now, given this probabilistic model, we can use the standard approach to classification given a probabilistic model. It is easily seen that when k = 1, Naive Bayes 1-grams as well as Naive Bayes (1) reduce to the Naive Bayes model. The relevant probabilities required for specifying the above models can be estimated using standard techniques for estimation of probabilities using Laplace estimators [52] ) ) SVM k-grams Note that the NB(k) algorithm was developed because NB k-grams systematically violates the independence assumption of Naïve Bayes. Against this background, it is of interest to consider other methods that can utilize k-gram frequencies without relying on the independence assumptions made by NB k-grams and without the need for explicit modelling of dependencies as in the case of NB(k). Hence, we consider a Support Vector Machine (SVM) classifier [13,14] that accepts as input, a k gram probability distribution for the protein and outputs a class label. For our experiments we used the SMO algorithm implemented in Weka version [26]. PSI-BLAST As an additional benchmark to test the performance of our methods we used PSI- BLAST (version 2.2.9) [28]. PSI-BLAST compares an amino acid query sequence against a protein sequence database. For a given data set, we chose one sequence from the data set. This sequence is used as a test sequence; the remaining sequences in the data set are used as a training database. Using PSI-BLAST, we blasted the test sequence against the training database. If the top hit (the sequence with the lowest e- value) in the PSI-BLAST results has the same class as the test query sequence, the query sequence is scored as a true classification. Otherwise, if the top hit has a different class or no hit is reported at all, the query sequence is scored as a false classification. This is repeated for all sequences in the given data set. An e-value of.0001 was used for PSI-BLAST, with all other parameters set to their default values

15 DTree Method The DTree approach uses the outputs from each of our NB k-gram and NB(k) algorithms as the data representation. Each of these algorithms outputs a discrete value mapping back to the class list. If there are four classes, then the output belongs to {0,1,2,3} where each value corresponds to a class. Since there are six classifiers NB 2-gram, NB 3-gram, NB 4-gram, NB(2), NB(3), and NB(4) the data representation is just a 6-dimension vector of the 6 outputs from each of these classifiers. This 6-dimension vector is then used as input to a decision tree algorithm. For these experiments we used the commonly used decision tree algorithm C4.5 [25] implemented as the J4.8 algorithm in Weka version [26]. HDTree Method The HDTree approach uses the outputs from each of our NB k-gram and NB(k) algorithms, and the output from the PSI-BLAST classifier as the data representation. Each of these algorithms outputs a discrete value mapping back to the class list. If there are four classes, then the output belongs to {0,1,2,3} where each value corresponds to a class. Since there are seven classifiers NB 2-gram, NB 3-gram, NB 4-gram, NB(2), NB(3), NB(4), and PSI-BLAST the data representation is just a 7- dimension vector of the 7 outputs from each of these classifiers. This 7-dimension vector is then used as input to a decision tree algorithm. For these experiments we used the commonly used decision tree algorithm C4.5 [25] implemented as the J4.8 algorithm in Weka version [26]. Motif Detection We hypothesize that the likelihood ratios based on the k-gram probabilities given a specific class can be used to identify specific motifs in sequences that may be important for protein function. Based on this, we proposed the following procedure. First, only k-grams consisting of amino acids that are not independent given the class are identified as follows. If k P( k gram class) = P(1 gram i class) then the individual amino acids are i= 1 independent given the class. Because we are interested in k-grams consisting of amino acids that are not independent given the class, we can perform the test given by k P( k gram class) φ1 P(1 gram i class) i= 1 When φ 1 1, the amino acids are independent given the class. When φ 1 > 1 then the amino acids are dependent given the class. For example, the following test can be used to select the 3-grams (trimers) of interest from the class kinase: P( monomer i P( trimer ijk kinase) P( monomer kinase) j kinase) P( monomer k φ 1? kinase)

16 Where trimer ijk is the trimer defined by the i th amino acid (of the possible 20 amino acids) in position 1 of the trimer, the j th amino acid in position 2 of the trimer, and the k th amino acid in position 3 of the trimer. The trimer belongs to one of the 8000 possible different k-gram combinations of the 20-letter amino acid alphabet and φ1 is a cut-off value. For our study we empirically determined the most useful value of φ 1 to be 3.5. Among the k-grams are selected using the test described above, we are interested in k- grams that occur more often in a given class relative to the entire data set. The likelihood ratio is defined by: P( k gram class1) φ 2? P( k gram observed) Thus, we can identify the k-gram motifs associated with the kinase data set (for k=3) using the test: P( trimerijk kinase) φ 2? P( trimer Swissprot) ijk Where trimer ijk is defined the same as above. We calculated our observed probabilities by using counts from all the protein sequences found in SwissProt (over 170,000 sequences). The greater the value of φ 2 the more likely the k-gram will occur in the given class versus it occurring in SwissProt. For this study, we empirically determined the most useful value of φ 2 to be 3.5. To determine whether k-gram regions were within close proximity of each other, we used the graphical contacts tool provided by the Diamond Sting Millennium software package [53]. Authors' contributions CA conceived of and designed the study, carried out the data analysis and visualization, developed the Java computer code, and drafted the manuscript. AS contributed to algorithm development. DD and VH contributed to the design of the study, analysis and interpretation of results, and writing of the manuscript. All authors read and approved the final manuscript. Acknowledgements This research was supported in part by grants from the National Science Foundation ( ) and the National Institutes of Health (GM066387) to Vasant Honavar and Drena Dobbs. Carson Andorf has been supported in part by a fellowship funded by an Integrative Graduate Education and Research Training (IGERT) award ( ) from the National Science Foundation. The authors wish to thank members of their research group, especially Oksana Yakhnenko and Cornelia Caragea for helpful comments on drafts of this paper. ijk

17 References 1. Eisenberg D, Marcotte E, and Xenarios T, and Yeates I. Protein function in the post-genomic era. Nature. 2000, 405(6788): The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genet. 2000, (25): Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990, 215: Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999, 12(2): Huang J and Brutlag D. The emotif database. Nucleic Acids Res. 2001, 29(1): Ben-Hur A and Brutlag D. Remote homology detection: a motif based approach. Bioinformatics 2003, Vol. 19 Suppl Bailey T, Baker M, Elkan C, and Grundy W. Meme, mast, and meta-meme: New tools for motif discovery in protein sequences. Pattern Discovery in Biomolecular Data. Oxford University Press, Oxford, 1999: Baldi P and Brunak S. Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press; Wang X, Schroeder D, Dobbs D, and Honavar V. Automated data-driven discovery of protein function classifiers. Information Sciences : Andorf C, Dobbs D, and Honavar V. Discovering protein function classification rules from reduced alphabet representations of protein sequences. In: Proceedings of the Conference on Computational Biology and Genome Informatics Charniak E. Statistical Language Learning, Cambridge:. MIT Press; Peng F and Schuurmans D. Combining naive Bayes and n-gram language models for text classification. In: Twenty-Fifth European Conference on Information Retrieval Research (ECIR-03) Boser B, Guyon I, and Vapnik V. A training algorithm for optimal margin classifiers. Proceedings of the 5 th Annual ACM Workshop on Computational Learning Theory, Pittsburg, PA, ACM Press, 1992, Vapnik V. Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, Al-Shahib A, Breitling R, Gilbert D. Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinformatics. 2005, 4: Lanckriet G, Cristianini N, Jordan M, and Noble W. Kernal-based integration of genomic data using semidefinite programming. In: Kernal Methods in Computational Biology. Edited by Schoelkopf B, Tsuda K and Vert JP, Cambridge, MA: MIT Press,

18 17. Sarda D, Chua GH, Li KB, Krishnan A. pslip: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics. 2005, 6: Bhasin M, Garg A, Raghava GP. PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics. 2005;21(10): Nair R, Rost B. Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol. 2005, 348: Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem. 2005, 280(15): Hua S and Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 2001,17: Bhasin M and Raghava G. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI- BLAST. Nucleic Acids Res., Yan C, Dobbs D, Honavar V. A two-stage classifier for identification of protein-protein interface residues. Bioinformatics, Atalay V, Cetin-Atalay R. Implicit motif distribution based hybrid computational kernel for sequence classification. Bioinformatics. 2005; 21: Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kauffman, Witten I and Frank E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A, Gasteiger E, Martin M, Michoud K, O Donovan C, Phan I, Pilbout S, and Schneider M. The Swiss-Prot protein knowledgebase and its supplement trembl in Nucleic Acid Res. 2003,31: Altschul S, Madden T, Schaffer A, Zhang J, Miller W, and Lipman D. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acid Res. 1997, 2(17): Wang M, Yang J, Liu GP, Xu ZJ, Chou KC. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng Des Sel Jun;17(6): Cai YD, Chou KC. Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem Biophys Res Commun. 2003, 305(2): Cai YD, Chou KC. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics. 2004; 20: Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, defays K, Lambert C, Nakai K, Brinkman FS. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 2003, 31:

19 33. Raghava GP, Han JH. Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics. 2005, 6: Lanckriet G, Cristianini N, Jordan M, and Noble W. Kernal-based integration of genomic data using semidefinite programming. In: Kernal Methods in Computational Biology. Edited by Schoelkopf B, Tsuda K and Vert JP, Cambridge, MA: MIT Press, Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19: Leslie C, Eskin E, and Noble W. The Spectrum Kernel: A String Kernel for SVM Protein Classification. Proceedings of the Pacific Symposium on Biocomputing 2002, January 2-7: Yee B, Cheng M, Carbonell J, Klein-Seetharaman J. Protein classification based on text document classification techniques Proteins: Structure, Function, and Bioinformatics. 2005, 58(4): Yakhnenko O, Silvescu A, and Honavar V. Discriminatively Trained Markov Model for Sequence Classification. Proceedings of the IEEE Conference on Data Mining (ICDM 2005). IEEE Press. In press. 39. Kriegel HP, Kroeger P, Pryakhin A, and Schubert M. Using Support Vector Machines for Classifying Large Sets of Multi-Represented Objects. Proceedings of th 4th SIAM Int. Conf. on Data Mining, 2004: Clare A and King RD. Machine learning of functional class from phenotype data, Bioinformatics, 2002,18: Wu F, Zhang J, and Honavar V. Learning Classifiers Using Hierarchically Structured Class Taxonomies. Proceedings of the Symposium on Abstraction, Reformulation, and Approximation (SARA 2005). Edinburgh. Berlin, Springer-Verlag. In press. 42. Letovsky S and Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics. 2003, 19 (Suppl.1): Deng M, Tu Z, Sun F, Chen T. Mapping Gene Ontology to proteins based on protein-protein interaction data. Bioinformatics. 2004, 20: Lagreid A, Hvidsten TR, Midelfart H, Komorowski J, and Sandvik AK. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns Genome Res. 2003, 13(5): Hayete B and Bienkowska JR. Gotrees: predicting go associations from protein domain composition using decision trees. Pac Symp Biocomput. 2005: Prince T and Matts RL. Definition of Protein Kinase Sequence Motifs That Trigger High Affinity Binding of Hsp90 and Cdc37. J. Biol. Chem. 2004, 279: Li K, Zhao S, Karur V, and Wojchowski DM. DYRK3 Activation, Engagement of Protein Kinase A/cAMP Response Element-binding Protein, and Modulation of Progenitor Cell Survival. J. Biol. Chem. 2002, 277(49):

20 48. Kung HJ, Chen HC, Robinson D. Molecular Profiling of Tyrosine Kinases in Normal and Cancer Cells. J Biomed Sci 1998, 5: Zhu X, Kim JL, Rose PE, Stover DR, Toledo LM, Zhao H, Morgenstern KA. Structural Analysis of the Lymphocyte-Specific Kinase Lck in Complex with Non- Selective and Src Family Selective Kinase Inhibitors Structure (London.) 1999, 7: Cowell R, Dawid A, Lauritzen S, and Spiegelhalter D. Probabilistic Networks and Expert Systems. Springer; Silvescu A, Andorf C, Dobbs D, and Honavar V. Inter-element dependency models for sequence classification, Technical report, Department of Computer Science, Iowa State University, [ Mitchell T. Machine learning. New York, USA, McGraw Hill, Neshich G, Mancini AL, Yamagishi ME, Kuser PR, Fileto, et al. STINGReport: convenient web-based application for graphic and tabular presentations of proteinsequence, structure and function descriptors from the STING database. Nucleic Acids Res. 2005, 33, Database Issue: D269-D274. Figures Figure 1 - Kinase Protein Structures with Highlighted Functional Motif Candidates Structures for three proteins: the lymphocyte-specific kinase Lck [PDB: 1QPC], human cyclin-dependent kinase 2 [PDB: 1B38], and human serine/threonine kinase Pak1 [PDB: 1F3M] are shown, with four candidate functional motifs (identified by likelihood and independence ratios) highlighted. Functional motif MAPE is labelled 1 (Blue), the motif identified by our method, DVWS or DIWSL is labelled 2 (Green), the active site motif HRDL is labelled 3 (Red) and the functional motif DFG is labelled 4 (Orange). Potential non-covalent bonds among predicted motifs are shown in a contact map in the box below each structure. Using the distances between individual residues within motifs and the geometric relationships among atoms in these residues in the three-dimensional structure, possible bonds (Cα less than 4 Ǻ) could be formed between residues in different motifs. Residues are represented by circles labelled with the corresponding amino acid symbol and possible non-covalent contact bonds are represented by lines between two residues. Possible contacts can be formed between regions HRDL and DFG; HRDL and MAPE; HRDL and DIWSL; and DIWSL and MAPE. The predicted bonds were determined using the Sting Millennium Package [33]. Figure 2 - Undirected Graphical Models Graphical depiction of the dependence between the elements in a sequence of five elements using Undirected Graphical Models (for protein data, nodes represent amino acids and edges represent dependencies between amino acids): a) Naïve Bayes b) pairwise dependence (k = 2) and c) 3-way dependence (k=3)

21 Tables Table 1 - Kinase data set results Accuracy of classification (estimated by cross-validation) for the Kinase data set. Note that Naïve Bayes classifier is defined for k=1 and has the same model (and therefore accuracies) as NB 1-gram and NB(1). Experiments with SVM k-grams for k>3 are typically infeasible for large data sets because of memory requirements. Percent Identity UnBLASTable Size Naïve Bayes NB 1-gram NB 2-gram NB 3-gram NB 4-gram NB (1) NB (2) NB (3) NB (4) SVM 1-gram SVM 2-gram SVM 3-gram PSI-BLAST N/A DTree HDTree Table 2 - Kinase/Ligase data set results Accuracy of classification (estimated by cross-validation) for the Kinase/Ligase data set. Note that Naïve Bayes classifier is defined for k=1 and has the same model (and therefore accuracies) as NB 1-gram and NB(1). Experiments with SVM k-grams for k>3 are typically infeasible for large data sets because of memory requirements. Percent Identity UnBLASTable Size Naïve Bayes NB 1-gram NB 2-gram NB 3-gram NB 4-gram NB (1) NB (2) NB (3) NB (4) SVM 1-gram SVM 2-gram SVM 3-gram PSI-BLAST N/A DTree HDTree

22 Table 3 - Kinase/Ligase/Helicase/Isomerase data set results Accuracy of classification (estimated by cross-validation) for the Kinase/Ligase/Helicase/Isomerase data set. Note that Naïve Bayes classifier is defined for k=1 and has the same model (and therefore accuracies) as NB 1-gram and NB(1). Experiments with SVM k-grams for k>3 are typically infeasible for large data sets because of memory requirements. Percent Identity UnBLASTable Size Naïve Bayes NB 1-gram NB 2-gram NB 3-gram NB 4-gram NB (1) NB (2) NB (3) NB (4) SVM 1-gram SVM 2-gram SVM 3-gram PSI-BLAST N/A DTree HDTree Additional files Additional file 1 Figure1.jpg Additional file 2 Figure2.jpg

23 Figure 1

Model Accuracy Measures

Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses