Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4. University, Ames, Iowa, 50010, USA. Ames, Iowa, 50010, USA

Size: px
Start display at page:

Download "Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4. University, Ames, Iowa, 50010, USA. Ames, Iowa, 50010, USA"

Transcription

1 Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families: Combining of Function Annotation Using Sequence Homology With that Based on Amino Acid k-gram Composition Yields More Accurate Classifiers Than Either of the Individual Approaches Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4 1 Artificial Intelligence Laboratory, Department of Computer Science, Iowa State University, Ames, Iowa, 50010, USA 2 Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa, 50010, USA 3 Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa, 50010, USA 4 Center for Computational Intelligence, Learning, and Discovery Corresponding author addresses: CA: andorfc@cs.iastate.edu AS: silvescu@cs.iastate.edu DD: ddobbs@iastate.edu VH: honavar@cs.iastate.edu - 1 -

2 Abstract Background Assigning putative functions to novel proteins and discovering how sequence correlates with protein function are important challenges in bioinformatics. We explore several machine learning approaches to data-driven construction of classifiers for assigning protein sequences to appropriate Gene Ontology (GO) functional families using a class conditional probabilistic representation of amino acid sequences. Specifically, we represent protein sequences using a class conditional probability distribution of amino acids (amino acid composition) or short (k-letter) subsequences (k-grams) of amino acids. We compare several sequence-based classifiers for assigning putative functions to proteins including: NB k-grams, which ignores the statistical dependencies among overlapping k-grams; NB(k), which models such dependencies; SVM k-grams, a support vector machine (SVM) classifier trained on amino acid k-gram composition-based representation of protein sequences, an approach that combines the outputs of the classifiers based on amino acid k-gram composition using a two-stage decision tree classifier (DTree) and an approach that combines function annotation based on sequence similarity (obtained using PSI-BLAST) with that based on amino acid k-gram composition using a twostage decision tree classifier (HDTree). Results We report the performance of NB k-grams, NB(k), SVM k-grams, PSI-BLAST, DTree and HDTree classifiers on data sets of three functional families from the Gene Ontology (GO). Each of the proposed methods is effective in correctly assigning GO function categories to protein sequences, even when they share little sequence identity (10% or less) with sequences of known function. The performance of each method is measured in terms of accuracy of classification, as measured by cross-validation. Of the four 1-stage classifiers, NB k-gram outperformed (or tied for best performance) on 11 of the 21 data sets, NB(k) on 5 of the 21 data sets, PSI-BLAST on 5 of the 21 data sets, and SVM k-grams on 3 of the 21 data sets. For the aforementioned results, the optimal value of k was known. The optimal value is not constant for each data set and can only be experimentally determined. This was the motivation for DTree. By combining the outputs for NB k-gram and NB(k) for all values of k, we were able to improve classification performance on 6 of the data sets and have more consistent results over all data sets then any classifier with a fixed value of k. These results significantly improve when we add the complimentary results of PSI-BLAST. The two-stage decision tree classifier HDTree outperforms DTree, NB k-gram, NB(k), PSI-BLAST and SVM k-gram on a vast majority (18 out of 21) of the data sets. This ratio becomes 15 out of 15 data sets when the sequence identity is in the range of 10% to 90%. We show how the likelihood ratios of k-gram occurrences for a specific protein functional class can be used to identify amino acid motifs that are reliable predictors of the corresponding functional class. Motifs identified using this approach were, in several cases, shown to correspond to experimentally-determined active sites and other functional motifs

3 Conclusions We have shown that amino acid k-gram compositions of protein sequences offer an inexpensive, yet highly effective source of information for predicting the GO function annotations of proteins. Our experimental results demonstrate the feasibility of using machine learning approaches that require only the amino acid k-gram compositions to automatically and reliably generate GO function annotations of protein sequences, even in cases where the sequence identity between the query sequence and the data set of annotated proteins (training set) is extremely low. Our results show that by combining amino acid k-gram based protein function classifiers with function annotation based on sequence homology (using PSI-BLAST), we can build very strong classifiers whose accuracies reach well over 90%. Our results also suggest that this approach can be extended to the efficient computational identification of potentially functionally significant sequence motifs, without the need for computationally expensive sequence alignment. Background Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters and molecular machines in cells. Experimental determination of protein structure and function significantly lags behind the rate of growth of protein sequence databases. This situation is likely to continue for the foreseeable future. Hence, assigning putative functions to proteins based on sequences alone remains one of the most challenging problems in functional genomics [1]. Improvements in annotating protein sequences can be expected to yield significant improvements in gene annotations. Against this background, there has been a great deal of interest in the development of automated annotation of protein sequences with Gene Ontology (GO) [2] function labels. One class of sequence-based approaches for functional annotation relies on the comparison of the sequence in question to other sequences in a database of sequences with known function. Functional assignment is made by transference of function whenever sequences are sufficiently similar. A commonly employed notion of similarity is based on estimated sequence homology using programs such as BLAST and its derivatives [3]. Sequence searches often return multiple hits, so significant human expertise is needed for interpreting results. The reliability of this approach rapidly drops once the pair-wise sequence identity drops below 30 percent [4]. A second class of sequence-based approaches for assigning putative functions to protein sequences rely on the detection of sequence patterns. (Several automated tools for identifying conserved sequence patterns from a given set of sequences e.g., e-motif and e-matrix [5, 6], MEME [7], are available.) Motif databases can be queried using a protein sequence to obtain a list of conserved sequence patterns found in the sequence as well as functions associated with the respective patterns. The results can be used to assign putative functions to the protein sequence. In the case of protein families having sufficient numbers of well-characterized members, data mining approaches rooted in statistical inference and machine learning [8] offer an attractive and costeffective approach to automated construction of classifiers for assigning putative functions to novel protein sequences. In essence, the data mining approach uses a representative training data set that encodes information about proteins with known functions to build a classifier for assigning proteins to one of the functional families - 3 -

4 represented in the training set (and if necessary, a default class indicating unknown function). The resulting classifier can then be used to assign novel protein sequences to one of the protein families represented in the training set after it has been validated using an independent test set (which was not used to build the classifier). Recent work by our group [9, 10] has explored the use of machine learning approaches to automated construction of such classifiers. In this paper, we explore machine learning approaches for prediction of GO functional categories of protein sequences, with emphasis on methods that utilize only the readily available amino acid sequences for the target proteins. We are especially interested in the effectiveness of such methods in reliably assigning GO functions to protein sequences that share little sequence identity with sequences that have experimentally-validated GO function labels. We are also interested in methods that can not only predict the GO function categories of target proteins, but also help identify putative functional sites, using only primary sequence information. We compare four different methods that use class conditional probabilities of k-grams k-letter subsequences to represent amino acid sequences. The first method uses a Naive Bayes classifier that treats each amino acid sequence as if it were simply a bag of amino acids. The second method (NB k-grams) applies the Naive Bayes classifier to a bag of k-grams (k > 1). Note that NB k-gram violates the Naive Bayes assumption of independence in an obvious fashion: neighbouring k-grams overlap along the sequence; adjacent k-grams have k-1 elements in common. The third method overcomes this problem by constructing an undirected graphical probabilistic model for k-grams [11, 12], which explicitly models the dependencies among overlapping k- grams in a sequence. We train one such model per functional family. During classification, just as in the case of the Naive Bayes classifier, the sequence to be classified is assigned to the class that has the largest posterior probability given the sequence. We call the resulting classifier NB(k) to denote the fact that it models dependencies among k adjacent elements of sequences. Note that NB(1) is equivalent to NB 1-grams, which in turn is equivalent to the Naive Bayes (NB) classifier. Our fourth method applies a support vector machine (SVM) [13,14] to classify amino acid sequences represented using class conditional probability distributions of k- grams in the sequence. SVMs have recently been applied successfully to many problems in computational biology including protein function classification [15], protein subcellular localization [16,17,18,19,20,21,22], and identification of proteinprotein interaction sites from sequences [23]. Methods similar to our SVM k-grams method have been independently developed and applied to the task of predicting subcellular localization of proteins [21,22]. We compare two two-stage hybrid methods that use the output of the previous methods as input to the classifier. Two-stage classifiers have been previously shown to boost performance. [18,21,24] Both these methods use a simple decision tree algorithm [25, 26] to build classifiers. The first method uses the output from the NB k-gram and NB(k) classifiers for k values of 2, 3, and 4. The second method uses the same input in addition to the output of a PSI-BLAST classifier. We have shown previously [10] that NB k-gram and NB(k) can produce very reliable classifiers. We showed that neither NB k-gram nor NB(k) consistently outperformed each other. Also, there was no fixed value of k that was optimal for all data sets. By combining the outputs of these methods, we are combining overlapping but complementary information. This increases the flexibility and predictability of our algorithm. PSI

5 BLAST adds additional complementary information. The other methods are based on the k-gram composition of proteins; PSI-BLAST is a homology based tool that uses sequence alignment and does well when proteins have high sequence similarity. The goal of our new methods is combine the advantages of tools based on a range of k- gram composition with tools based on sequence homology. NB, NB(k) and NB k-grams classifiers have the advantage of training with only one pass through the training data and hence lend themselves to incremental update as new training data become available. SVMs, on the other hand, often achieve higher classification accuracy than is achievable using algorithms that make a single pass through the training data by optimizing the classifiers to trade-off the complexity of the classifiers against accuracy on training data (a process referred to as regularization in the machine learning literature). The increased accuracy comes at the expense of increased computational requirements. On a large data set, training an SVM classifier typically takes orders of magnitude more time compared to NB(k) and NB k-grams, ruling out their use in cases where it is necessary to update the classifiers frequently as new training data become available. Hence, it is of interest to compare the performance of SVM k-grams classifiers with computationally less expensive alternatives that such as NB k-grams and NB(k) that lend themselves to incremental update as new training data become available. We compare Naive Bayes, NB k-grams, NB(k), SVM k-grams classifiers with our two-stage classifiers for assigning protein sequences to the corresponding GO (the Gene Ontology [2]) functional families. The sequence data sets used in our experiments were extracted from SwissProt [27]. In our experiments when comparing classifiers using the same k-value, the NB k-gram classifier outperformed (in terms of classification accuracy), the standard Naive Bayes classifier by a large margin (20% - 40%), the NB(k) classifier outperformed NB k-grams classifier by a few percentage points, and SVM k-grams outperformed NB(k) in a large majority of the test cases. It is worth noting that among the classifiers that performed the best regardless of the value of k, NB k-gram or NB(k) outperformed the other methods on 13 of the 21 test cases and 10 of the 15 test cases on data sets with sequence similarity ranging from 10% 90%. Classification performance could be further improved by using the twostage approach. The overall accuracy was improved in 6 of the 21 datasets when DTree was used and in 18 of the 21 data sets when HDTree was used. In some cases the improvement was over 10%. Results and Evaluation Data Sets Data set 1 (Kinase data set) was derived from families of yeast and human kinases. These families were chosen for this study because many of them are wellcharacterized, with known structures and functions. The data set used in this study consists of 288 proteins belonging to the Gene Ontology functional family GO , Protein Kinase Activity. We classified them according to three GO groups just below GO in the hierarchy: GO , protein serine/threonine kinase activity (209 proteins); GO , protein-tyrosine kinase activity (69 proteins); GO , and protein threonine/tyrosine kinase activity (10 proteins). Because GO is represented as a directed acyclic graph, some proteins may be represented in multiple classifications

6 Data set 2 (Kinase/Ligase data set) is derived from two subfamilies of GO , Catalytic Activity. This division is at a higher level of the GO hierarchy and consists of 376 proteins belonging to two functional families: GO , Protein Kinase Activity (158 proteins) and GO001684, Protein Ligase Activity (218 proteins). Data set 3 (Kinase/Ligase/Helicase/Isomerase data set) is a superset of the second data set. It contains the kinases and ligases as well as members of two additional subfamilies of GO : GO , Protein Helicase Activity (110 proteins), and GO , Protein Isomerase Activity (86 proteins). This data set enables us to evaluate classifier performance on a larger number of protein classes and at a high level of the GO hierarchy. It includes a total of 572 proteins. Each data set was filtered to remove proteins that had multiple GO class labels to ensure that the classes are non-overlapping - a requirement for all of the methods considered in this paper, and most standard machine learning and statistical methods for classification. After the functional classes were extracted from GO, the data sets used in this study were obtained by extracting the corresponding sequences from SwissProt [27]. To examine the effect of sequence identity on the performance of classifiers, we created seven subsets from the three larger "functional class" data sets by clustering the sequences in each data set according to percentage sequence similarity among proteins using BLASTCLUST [28]. For example, for an identity cut-off of 50%, protein A belongs to a cluster if and only if there exists a protein B in the same cluster such that proteins A and B have at least 50% sequence identity within 90% of the sequence length. Any two proteins not in the same cluster have an identity score of less than 50%. Using this method, six subsets of the original data sets were created using identity scores of 100%, 90%, 70%, 50%, 30%, and 10%. For testing classifiers, one sequence from each cluster was chosen at random as a representative of that cluster and all other proteins in that cluster were removed from the data set. This procedure ensures that no test sequence can have a sequence identity percentage greater than the designated cut-off with any other sequence (of the same GO functional class) in the training set. The seventh subset of the original data set was created with an even more stringent identity criterion: for each data set we ran PSI-BLAST [28] with the data set against itself. We chose a sequence at random and removed any other sequence belonging to the same class that had a PSI-BLAST hit with an e-value of.0001 or higher. This procedure was repeated on the remaining sequences, in each case, eliminating from the data set all sequences that yielded a PSI-BLAST hit with an e-value of.0001 or higher, and terminating the process when no more sequences could be eliminated. We call the resulting data set unblastable, because PSI-BLASTing any sequence within the subset with any other sequence (with the same class) in the subset is guaranteed not to return a hit. Thus, by clustering the sequences within each of the three larger functional data sets into seven subsets, we generated a total of twenty-one data sets for testing the effect of sequence identity on the performance of classifiers

7 Experiments The computational experiments were motivated by the following questions: a) How do NB k-grams, NB(k), and SVM k-grams models compare with each other and against the baseline represented by Naïve Bayes (NB) classifier? b) How do they compare with PSI-BLAST on the same data sets? c) Can using a two-stage classifier improve performance? d) Will building a two-stage classifier based on NB k-gram, NB(k), and PSI- BLAST results further improve performance? e) What is the effect of k (which can be viewed as a measure of the complexity of the models in question) on the classification accuracy? f) What is the effect of sequence identity (within the data set used to train and test classifiers) on classifier performance? g) Can this approach provide a computationally efficient method for identifying possible functional motifs? NB k-grams and NB(k) models were constructed and evaluated on the twenty-one data sets for different choices of k from 1 to 4. Values of k larger than 4 were not considered because at higher values of k, there is insufficient data to obtain reliable probability estimates. The SVM k grams model, using a linear SVM kernel, was tested with values of k from 1 to 3. Higher values of k were not explored because of computational and memory requirements. The reported accuracy estimates are based on stratified 10-fold cross validation. Within the 10-fold cross validation experiments, most individual standard deviations for classifiers were under 1% and never exceeded 2%, thus there was little variability among the individual classifiers generated from different folds of the cross validation experiments. Tables 1, 2 and 3 show comparisons of results obtained with Naïve Bayes, NB k- gram, NB(k), SVM k-gram, the PSI-BLAST algorithms, and the two-stage classifiers in predicting membership in each of the three GO functional families used in this study. Results Classification performance results obtained using the Kinase data set (Data set 1) are shown in Table 1. When no sequence identity cutoff was used (or equivalently, identity cut off is set at 100%) we obtained a 63% classification accuracy using PSI- BLAST, a 66% accuracy using Naive Bayes alone and 83% using SVM 1-gram. By increasing the value of k to 2, accuracy increased to 82% for NB 2-grams, 89% with NB(2), and 91% for SVM 2-grams. For NB(2), this represents a 23% improvement over Naive Bayes and 7% improvement over NB 2-grams, and SVM 2-grams outperformed NB(2) by less than 3%. NB 3-grams and NB(3) had accuracies of 89.9% and 92.0% respectively, with NB(3) (accuracy of 92.0%) outperforming SVM 3-grams (accuracy of 89.6%). All three methods outperformed PSI-BLAST (accuracy of 89.3%) Increasing k to 4 yielded little overall improvement in classification accuracy on this data set. NB 4-grams improved by 3% and NB(4) had lower accuracy relative to NB(3). This can be explained by the fact that as k increases, the probability estimates become less and less reliable (as we run out of data). The classification accuracies for the Kinase data set over a wide range of sequence identity cut-offs are also shown in Table 1. The NB k-grams, NB(k), and SVM k

8 grams methods performed well over a wide range of sequence identities. For example, when k=3, NB 3-grams had accuracies ranging from 90% to 87% (a drop of only 3%) when sequence identity was reduced from 90% to 10%. Thus, the classifier is able to reliably assign function to query proteins that have very little sequence identity with annotated proteins in the training set. Similarly, NB(3) accuracy suffered only a 2.6% drop in accuracy when sequence identity fell from 90% to 10%, and for the same data sets, accuracy for the SVM 3-gram decreased by 7.2%. When sequence identity was less then 90%, NB k-gram classifier outperformed the SVM k-gram classifier on all 6 subsets of the Kinase data set and NB(k) outperformed the SVM k-gram classifier on 5 of the 6 subsets (with SVM outperforming NB(k) classifier by 0.1% in terms of classification accuracy for the Kinase data set with sequence identity cut-off of 90%). It is especially worth noting that the classifiers were effective on the most challenging data set that we examined: the unblastable data set on which the query sequence has no PSI-BLAST hits on the data (making transfer of annotation based on sequence identity using PSI-BLAST impossible). In contrast, NB 3-grams and NB(3) achieved accuracies of 91.9% each, and SVM 2-gram (the best performer among the SVM-k gram classifiers) had an accuracy of 88.7 on the same data set. Tables 2 and 3 show NB(k) or NB k-gram classifiers outperform SVM k-gram classifiers on the Kinase/Ligase (Table 2) and Kinase/Ligase/Isomerase/Helicase (Table 3) data sets at sequence identity cut-offs ranging from 10% to 90%. However, when the sequence identity cut-off is set to 100%, the SVM k-grams significantly outperforms NB(k), yielding 100% accuracy for the Kinase/Ligase data set and 92.8% accuracy for the Kinase/Ligase/Isomerase/Helicase data set. This represents an improvement of 8.6% and 10.2% respectively over the best accuracy of NB(k) in each of the data sets. It is also worth noting that on the unblastable subset of Kinase/Ligase data set, SVM 2-gram classifier with accuracy of 98.9% significantly outperforms NB, NB(k) and NB k-gram classifiers (with accuracies ranging from 74.4% to 76.6%). As expected, PSI-BLAST had relatively high classification accuracy on data sets corresponding to a sequence identity cut off of 100%. Interestingly, PSI-BLAST outperforms NB k-grams and NB(k) on Kinase/Ligase (Table 2) and Kinase/Ligase/Isomerase/Helicase (Table 3) data sets and SVM k-grams on the Kinase/Ligase/Isomerase/Helicase (Table 3) data set. In the case of data sets on which PSI-BLAST outperforms the other classifiers, almost one third of the sequences (111 out of 376 for Kinase/Ligase data set and 191 out of 572 in the case of Kinase/Ligase/Isomerase/Helicase) are nearly identical (with sequence identity greater than 90%). However, in the case of Kinase data set (Table 1) with sequence identity cutoff of 100%, the accuracy of PSI-BLAST (62.7%) is substantially worse than that of NB(k) (93% with k=4), NB k-gram (92% with k=2) and SVM k-gram (91.3% with k=3) classifiers. This is not surprising in light of the fact that only 7 out of 288 sequences in the Kinase data set have sequence identity over 90%. However, the classification accuracy obtained by PSI-BLAST dramatically decreases as the sequence identity between the query protein and the annotated sequences in the training set decreases. In the case of Kinase/Ligase (Table 2) data set classification accuracy for PSI-BLAST decreased to 82.3%, when the sequence identity cut-off was set to 90%. The accuracies for NB k-gram, NB(k), and SVM k-gram also decreased with decrease in sequence identity, but these methods were still able to achieve a classification accuracy of around 89.0% on Kinase/Ligase data set

9 It is especially worth noting that the NB k-gram and NB(k) classifiers outperformed the other classifiers on 11 out of 12 data sets for the Kinase and Kinase/Ligase data (See Tables 1-2) with sequence identity less than 90%, with SVM k-gram beating the others on the remaining data set. It is also worth noting that in the case of NB k- grams and NB(k) no single choice of k consistently outperforms all other choices of k. This raises the question as to whether it might be possible to further improve the results of such classifiers by using them in combination. To address this question, we constructed a decision tree classifier (DTree) that takes as input, the outputs of NB 2- gram, NB 3-gram, NB 4-gram and NB(2), NB(3), and NB(4) classifiers. Two decision trees built on the individual families of classifiers were also explored. One decision tree built on only the output of the NB k-gram family of classifiers and another decision tree built on only the output of the NB(k) family of classifiers were constructed. The performance accuracies of the resulting decision trees were 1 5% less then DTree so we did not report them in our results. On 6 of the 21 data sets the resulting decision tree classifier outperformed the best overall one-stage classifier from the set. (see Tables 1-3) On 17 of the 21 data sets, the decision tree s accuracy was above or within 1% of the best performing classifier from the set, and on 21 of the 21 data sets DTree s accuracy was higher than or within 3% of the best classifier from the set. Thus, our results suggest that it is beneficial to combine our classifiers in a two-stage approach. The result is a much more consistent classifier. Recall that on the 7 data sets extracted from the Kinase/Ligase/Helicase/Isomerase data, transfer of function annotation based on sequence homology turned out to be the most accurate method for assigning functions to proteins. Protein function assignments based on transfer of annotation from the top PSI-BLAST hits on the training data were more accurate than the function assignments produced by the other (amino acid k-gram composition based) methods on 6 of the 7 data sets. In light of these experimental results, it is natural to ask whether there is some benefit to be gained by combining both the sequence homology based tool such as PSI- BLAST with a classifiers trained on amino acid k-gram representations of protein sequences such as NB(k) or SVM k-grams. To answer this question, we constructed a two-stage classifier The DTree classifier was able to improve individual results some of the data sets and stay within 3% of the best individual classifier from the family of classifiers trained on amino acid k-gram representations of protein sequences on all of the data sets. We constructed a decision tree classifier (HDTree) that, takes as input, in addition to the inputs used by DTree, the function assignment based on sequence homology (obtained running PSI-BLAST on the training set). Our experiments show that HDTree outperforms DTree: Classifiers generated by HDTree outperform all the other methods on 18 of the 21 data sets. In the case of data sets with sequence identity cut-offs ranging from 10% to 90%, HDTree s overall accuracy was superior to that of other methods on 15 of the 15 data sets (100%). (see Tables 1-3) In a majority of the data sets this improvement was also significant. In the Kinase data sets the improvement in accuracy ranged from 2.1% (in the case of 100% sequence identity cutoff) to 4.2% (when sequence identity cut-off set to 10%) over NB k-gram and NB(k) classifiers and from 5.8% to 8.4% over the PSI-BLAST results. In the case of the Kinase/Ligase data set, the improvement in accuracy was over 8% for each of the data sets with sequence identity cutoffs ranging from 10% to - 9 -

10 90% for the NB k-gram and NB(k) classifiers and over 15% relative to the PSI- BLAST results. On the Kinase/Ligase/Isomerase/Helicase data sets the improvement in accuracy ranged from 13% to 17% over the NB k-gram and NB(k) classifiers and ranged from 1% (100% sequence identity cut-off) to 12% (30% sequence identity cutoff) over the PSI-BLAST results. In summary, the HDTree classifier which uses both amino acid k-gram composition as well as sequence homology to assign putative functions to proteins had the best overall classification accuracy of all the methods. Discussion #0,90/ 47 There has been some previous work using k-gram composition for protein sequence classification, including sequence-based assignment of putative functions to on proteins. Most of the focus has been on using amino acid composition (1-gram) [18,29,30,31,32] while other work has focused on using dipeptide composition (2- gram) [18,22,33,34,35] to predict protein subcellular localization. An SVM with a spectrum kernel to handle k-grams with k>2 has been reported [36], and k-grams have been used with a Naïve Bayes model for text classification problems [12,37]. Recently, a discriminatively trained version of the NB (k) classifier for sequence classification has been proposed [38]. Methods similar to our SVM k-grams have been independently developed and applied to the task of predicting subcellular localization of proteins [20,21]. In contrast, here we focus on prediction of GO functional labels. In the data sets used in this study, the class labels are mutually exclusive. However, many proteins are multi-functional. The development of effective methods for classification of data that are labelled with multiple, not necessarily mutually exclusive class labels or hierarchically structured class labels is largely an open problem in machine learning. However, some methods have recently been proposed for this problem [39,40,41]. Against this background, it would be interesting to extend the approaches explored in this paper to deal with hierarchically structured class labels. Several authors have recently explored the use of protein-protein interaction data [42, 43], gene expression data [44], and protein structural features [45] to develop methods for assigning putative GO function labels to proteins with unknown function. Against this background, systematic assessment of the utility of different types of information (relative to their cost) for automated GO function annotation of proteins represents an important future direction. Implications for Automated Sequence-Based GO Function Annotation Our results confirm the usefulness of classifiers using a class conditional probabilistic representation of amino acid sequences to predict GO functional families. It can also be useful to use PSI-BLAST for transfer of functional annotation to a query sequence based on sequence identity to proteins with known annotations when the level of sequence identity between the two is rather high. However, the accuracy of function annotations produced by PSI-BLAST can drop rapidly with decrease in sequence identity between the query sequence and the training set. In contrast, machine

11 learning methods that utilize amino acid k-gram compositions provide accurate functional annotations when proteins share a similar k-gram composition. NB k-grams and NB(k) outperform Naive Bayes in our experiments. In terms of accuracy, NB(k) and NB k-grams are very complimentary. There are many instances where NB(k), and NB k-grams respectfully outperformed each other on our test cases. Both of these machine learning methods consistently outperform PSI-BLAST on 2 of 3 GO functionally family, but PSI-BLAST significantly outperformed the two machine learning algorithms on the remaining GO functional family. When combining the results of NB(k), NB k-gram, and PSI-BLAST into a two-stage hybrid approach we are able to take advantage of detecting high similarity and high k-gram composition simultaneously in one unified classifier. NB k grams and NB(k) require only one pass through the data, which makes the resulting classifiers easy to construct and update as new data become available. In contrast, at present, there are no efficient algorithms for updating SVM classifiers to incorporate new data in an incremental fashion. This makes NB(k) an attractive alternative when using large data sets or data sets that are rapidly being updated or modified. PSI-BLAST can also be built incrementally. After an initial index has been build, you can query your sequence against the index and store the top score hit and its e-value. As new data appears, you query your sequence against this data and compare it to the previous top scoring e-value. If this new query has a smaller e-value then you replace your top scoring hit; if the query has a larger e-value then you move on to the next data. Finally, the second stage classifier is based on a simply decision tree algorithm [53,54]. The input to this classifier is only seven attributes and the resulting classifier can be built in seconds on a standard 32-bit machine. Therefore, our method is a very incrementally process that requires very minimal computational efforts yet works very well for automated GO functional annotation of protein sequences. Detecting Potential Functionally Significant Motifs from the Learned Classifiers In addition to being able to predict functional labels, the likelihood ratios based on the k-gram probabilities given a specific class can be used to identify specific motifs in protein sequences that may be significant for function. In several cases, we have noted that specific residues with top-ranking likelihood ratios correspond to positions in active sites or other functional motifs that have been previously identified by biochemical and genetic approaches. For example, in the Kinases, analysis of the likelihood ratios produced by the learned classifiers allowed us to identify the active site motif, HRDL, along with functional motifs, APE and DFG. [46,47] A fourth motif identified by the learned classifier, DIWSL and DVWSL, has also been experimentally determined to be a functional motif for kinases [48]. This fourth motif is located in close proximity to the three verified functional motifs within the folded protein structure. An example is shown in Figure 1a, in which all four motifs are mapped onto the three dimensional structure of a representative kinase, the Lymphocyte-Specific Kinase Lck [PDB: 1QPC], [49]. These three regions were experimentally determined to be important in Lck kinase function [46,47,48]. In our three examples the region DIWSL was always within close enough proximity to form contact regions with the active site HRDL and functional motif APE

12 In fact, the DIWSL motif forms several contacts (i.e., several amino acids in each motif have C carbons within 4 Ǻ of each other) with the HRDL and APE motifs. These motifs are illustrated on the structures of two other protein kinases, human cyclin-dependent kinase 2 [PDB: 1B38] and human serine/threonine kinase Pak1 [PDB: 1F3M] in Figure 1b and 1c, illustrating that this relationship is conserved in other kinase family members, as expected. An important advantage of this potential method for identifying functional protein sequence motifs is its lack of reliance on computationally expensive multiple sequence alignment. Additional studies are needed to evaluate the broader applicability of this proposed method for rapid sequence-based identification of functionally or structurally significant motifs in proteins. Future Directions Some directions for future work include: a) Further evaluation of the methods described here on a broader range of data sets; b) Direct comparison of the performance of sequence-based methods described here with methods that utilize structural information for query proteins (e.g., on cases drawn from structural genomics targets); c) Development of principled approaches to assigning a protein sequence simultaneously to multiple classes (in the case of multifunctional proteins); d) Assessment of the relative utility of other sources of information (e.g., expression data, interaction data, structural features) [42,43,44,45] for improving the accuracy of automated function annotation; e) Examination of the resulting classifiers to identify testable hypotheses concerning sequence correlates of protein function and to guide the design of experiments to validate such hypotheses. Conclusions The results presented in this paper show that amino acid k-gram compositions of sequences offer an inexpensive, yet highly effective, source of information for GO function annotation of protein sequences. Our results demonstrate the feasibility of developing fully automated and computationally efficient sequence-based approaches to automated functional annotation of proteins when they share very little sequence identity with previously annotated sequences. According to our results, this information is complementary to sequence homology and can be combined with PSI- BLAST results to be a flexible and powerful classifier that works well on a variety of data. They also suggest the possibility of identifying potentially functionally significant sequence motifs, without performing computationally expensive sequence alignment

13 Methods Classification Using a Probabilistic model Before outlining the two probabilistic models used for modelling the interactions among k consecutive elements in the sequence, we define a method to build a classifier associated with a probabilistic model. Suppose we have a probabilistic model α for sequences defined over some alphabet (which in our case is the 20-letter amino acid alphabet). The model α specifies for any sequence S = s1,..., sn the probability P α ( S = s1,..., sn ) from the probabilistic model using the following procedure: For each class c j. c train a probabilistic model α c ) using the sequence belonging to j Predict the classification c(s) of a novel sequence S = s,..., sn as given by: ( j c( S) = arg max Pα c ) ( S = s1,..., s c j C 1 ) P( c ( j n j Note that P S s,..., s c ) = P ( S = s,..., s ) therefore: j α ( = 1 n j α ( c ) 1 n c( S) = arg max Pα ( S = s1,..., sn c j ) P( c j ) c j C ) Naïve Bayes Classifier The Naïve Bayes classifier assumes that each element of the sequence is independent of the other elements given the class label. Consequently, n c( S) = arg max Pα Pα ( s c ) Pα ( s c j C i= 1 c ) P( c 1 j n j j Note that the Naive Bayes classifier for sequences treats each sequence as though it were simply a bag of letters. We now consider two Naive Bayes-like models based on k -grams. ) Naïve Bayes k-grams Classifier The Naive Bayes k-grams (NB k-grams) method uses a sliding a window of size k along each sequence to generate a bag of k-grams representation of the sequence. Much like in the case of the Naive Bayes classifier described above, it treats each k- gram in the bag to be independent of the others given the class label for the sequence. Given this probabilistic model, the previously outlined method for classification using a probabilistic model can be applied. The probability model associated with Naïve Bayes k-grams is: n k = si,..., Si+ k 1 = si+ k 1 c j ) P( c j c j C i= 1 Pα ( S = [ S = s,..., S n = sn ]) = arg max Pα Pα ( Si A problem with the NB k-grams approach is that successive k-grams extracted from a sequence share k-1 elements in common. This grossly and systematically violates the independence assumption of Naive Bayes. )

14 Naïve Bayes (k) We introduce the Naive Bayes (k) or the NB(k) model to explicitly model the dependencies that arise as a consequence of the overlap between successive k-grams in a sequence. Figure 2a shows the dependency model for a sequence of 5 elements. We represent the dependencies in a graphical form by drawing edges between the elements that are directly dependent on each other. The graph for pair wise dependencies is illustrated in Figure 2b and the one for 3-way dependency is depicted in Figure 2c. Using the Junction Tree Theorem for graphical models [50], it can be proved [51] that the correct probability model α that captures the dependencies among overlapping k- grams is given by: n i n i k+ 1 P ( S = s,..., S = 1 α i i i+ k 1 i+ k 1 α ( S = [ S1 = s1,..., S n = sn ]) = k+ 1 P P ( S = s,..., S = s = s = 2 α i i i+ k 2 i+ k 2 Now, given this probabilistic model, we can use the standard approach to classification given a probabilistic model. It is easily seen that when k = 1, Naive Bayes 1-grams as well as Naive Bayes (1) reduce to the Naive Bayes model. The relevant probabilities required for specifying the above models can be estimated using standard techniques for estimation of probabilities using Laplace estimators [52] ) ) SVM k-grams Note that the NB(k) algorithm was developed because NB k-grams systematically violates the independence assumption of Naïve Bayes. Against this background, it is of interest to consider other methods that can utilize k-gram frequencies without relying on the independence assumptions made by NB k-grams and without the need for explicit modelling of dependencies as in the case of NB(k). Hence, we consider a Support Vector Machine (SVM) classifier [13,14] that accepts as input, a k gram probability distribution for the protein and outputs a class label. For our experiments we used the SMO algorithm implemented in Weka version [26]. PSI-BLAST As an additional benchmark to test the performance of our methods we used PSI- BLAST (version 2.2.9) [28]. PSI-BLAST compares an amino acid query sequence against a protein sequence database. For a given data set, we chose one sequence from the data set. This sequence is used as a test sequence; the remaining sequences in the data set are used as a training database. Using PSI-BLAST, we blasted the test sequence against the training database. If the top hit (the sequence with the lowest e- value) in the PSI-BLAST results has the same class as the test query sequence, the query sequence is scored as a true classification. Otherwise, if the top hit has a different class or no hit is reported at all, the query sequence is scored as a false classification. This is repeated for all sequences in the given data set. An e-value of.0001 was used for PSI-BLAST, with all other parameters set to their default values

15 DTree Method The DTree approach uses the outputs from each of our NB k-gram and NB(k) algorithms as the data representation. Each of these algorithms outputs a discrete value mapping back to the class list. If there are four classes, then the output belongs to {0,1,2,3} where each value corresponds to a class. Since there are six classifiers NB 2-gram, NB 3-gram, NB 4-gram, NB(2), NB(3), and NB(4) the data representation is just a 6-dimension vector of the 6 outputs from each of these classifiers. This 6-dimension vector is then used as input to a decision tree algorithm. For these experiments we used the commonly used decision tree algorithm C4.5 [25] implemented as the J4.8 algorithm in Weka version [26]. HDTree Method The HDTree approach uses the outputs from each of our NB k-gram and NB(k) algorithms, and the output from the PSI-BLAST classifier as the data representation. Each of these algorithms outputs a discrete value mapping back to the class list. If there are four classes, then the output belongs to {0,1,2,3} where each value corresponds to a class. Since there are seven classifiers NB 2-gram, NB 3-gram, NB 4-gram, NB(2), NB(3), NB(4), and PSI-BLAST the data representation is just a 7- dimension vector of the 7 outputs from each of these classifiers. This 7-dimension vector is then used as input to a decision tree algorithm. For these experiments we used the commonly used decision tree algorithm C4.5 [25] implemented as the J4.8 algorithm in Weka version [26]. Motif Detection We hypothesize that the likelihood ratios based on the k-gram probabilities given a specific class can be used to identify specific motifs in sequences that may be important for protein function. Based on this, we proposed the following procedure. First, only k-grams consisting of amino acids that are not independent given the class are identified as follows. If k P( k gram class) = P(1 gram i class) then the individual amino acids are i= 1 independent given the class. Because we are interested in k-grams consisting of amino acids that are not independent given the class, we can perform the test given by k P( k gram class) φ1 P(1 gram i class) i= 1 When φ 1 1, the amino acids are independent given the class. When φ 1 > 1 then the amino acids are dependent given the class. For example, the following test can be used to select the 3-grams (trimers) of interest from the class kinase: P( monomer i P( trimer ijk kinase) P( monomer kinase) j kinase) P( monomer k φ 1? kinase)

16 Where trimer ijk is the trimer defined by the i th amino acid (of the possible 20 amino acids) in position 1 of the trimer, the j th amino acid in position 2 of the trimer, and the k th amino acid in position 3 of the trimer. The trimer belongs to one of the 8000 possible different k-gram combinations of the 20-letter amino acid alphabet and φ1 is a cut-off value. For our study we empirically determined the most useful value of φ 1 to be 3.5. Among the k-grams are selected using the test described above, we are interested in k- grams that occur more often in a given class relative to the entire data set. The likelihood ratio is defined by: P( k gram class1) φ 2? P( k gram observed) Thus, we can identify the k-gram motifs associated with the kinase data set (for k=3) using the test: P( trimerijk kinase) φ 2? P( trimer Swissprot) ijk Where trimer ijk is defined the same as above. We calculated our observed probabilities by using counts from all the protein sequences found in SwissProt (over 170,000 sequences). The greater the value of φ 2 the more likely the k-gram will occur in the given class versus it occurring in SwissProt. For this study, we empirically determined the most useful value of φ 2 to be 3.5. To determine whether k-gram regions were within close proximity of each other, we used the graphical contacts tool provided by the Diamond Sting Millennium software package [53]. Authors' contributions CA conceived of and designed the study, carried out the data analysis and visualization, developed the Java computer code, and drafted the manuscript. AS contributed to algorithm development. DD and VH contributed to the design of the study, analysis and interpretation of results, and writing of the manuscript. All authors read and approved the final manuscript. Acknowledgements This research was supported in part by grants from the National Science Foundation ( ) and the National Institutes of Health (GM066387) to Vasant Honavar and Drena Dobbs. Carson Andorf has been supported in part by a fellowship funded by an Integrative Graduate Education and Research Training (IGERT) award ( ) from the National Science Foundation. The authors wish to thank members of their research group, especially Oksana Yakhnenko and Cornelia Caragea for helpful comments on drafts of this paper. ijk

17 References 1. Eisenberg D, Marcotte E, and Xenarios T, and Yeates I. Protein function in the post-genomic era. Nature. 2000, 405(6788): The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genet. 2000, (25): Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990, 215: Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999, 12(2): Huang J and Brutlag D. The emotif database. Nucleic Acids Res. 2001, 29(1): Ben-Hur A and Brutlag D. Remote homology detection: a motif based approach. Bioinformatics 2003, Vol. 19 Suppl Bailey T, Baker M, Elkan C, and Grundy W. Meme, mast, and meta-meme: New tools for motif discovery in protein sequences. Pattern Discovery in Biomolecular Data. Oxford University Press, Oxford, 1999: Baldi P and Brunak S. Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press; Wang X, Schroeder D, Dobbs D, and Honavar V. Automated data-driven discovery of protein function classifiers. Information Sciences : Andorf C, Dobbs D, and Honavar V. Discovering protein function classification rules from reduced alphabet representations of protein sequences. In: Proceedings of the Conference on Computational Biology and Genome Informatics Charniak E. Statistical Language Learning, Cambridge:. MIT Press; Peng F and Schuurmans D. Combining naive Bayes and n-gram language models for text classification. In: Twenty-Fifth European Conference on Information Retrieval Research (ECIR-03) Boser B, Guyon I, and Vapnik V. A training algorithm for optimal margin classifiers. Proceedings of the 5 th Annual ACM Workshop on Computational Learning Theory, Pittsburg, PA, ACM Press, 1992, Vapnik V. Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, Al-Shahib A, Breitling R, Gilbert D. Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinformatics. 2005, 4: Lanckriet G, Cristianini N, Jordan M, and Noble W. Kernal-based integration of genomic data using semidefinite programming. In: Kernal Methods in Computational Biology. Edited by Schoelkopf B, Tsuda K and Vert JP, Cambridge, MA: MIT Press,

18 17. Sarda D, Chua GH, Li KB, Krishnan A. pslip: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics. 2005, 6: Bhasin M, Garg A, Raghava GP. PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics. 2005;21(10): Nair R, Rost B. Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol. 2005, 348: Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem. 2005, 280(15): Hua S and Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 2001,17: Bhasin M and Raghava G. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI- BLAST. Nucleic Acids Res., Yan C, Dobbs D, Honavar V. A two-stage classifier for identification of protein-protein interface residues. Bioinformatics, Atalay V, Cetin-Atalay R. Implicit motif distribution based hybrid computational kernel for sequence classification. Bioinformatics. 2005; 21: Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kauffman, Witten I and Frank E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A, Gasteiger E, Martin M, Michoud K, O Donovan C, Phan I, Pilbout S, and Schneider M. The Swiss-Prot protein knowledgebase and its supplement trembl in Nucleic Acid Res. 2003,31: Altschul S, Madden T, Schaffer A, Zhang J, Miller W, and Lipman D. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acid Res. 1997, 2(17): Wang M, Yang J, Liu GP, Xu ZJ, Chou KC. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng Des Sel Jun;17(6): Cai YD, Chou KC. Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem Biophys Res Commun. 2003, 305(2): Cai YD, Chou KC. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics. 2004; 20: Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, defays K, Lambert C, Nakai K, Brinkman FS. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 2003, 31:

19 33. Raghava GP, Han JH. Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics. 2005, 6: Lanckriet G, Cristianini N, Jordan M, and Noble W. Kernal-based integration of genomic data using semidefinite programming. In: Kernal Methods in Computational Biology. Edited by Schoelkopf B, Tsuda K and Vert JP, Cambridge, MA: MIT Press, Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19: Leslie C, Eskin E, and Noble W. The Spectrum Kernel: A String Kernel for SVM Protein Classification. Proceedings of the Pacific Symposium on Biocomputing 2002, January 2-7: Yee B, Cheng M, Carbonell J, Klein-Seetharaman J. Protein classification based on text document classification techniques Proteins: Structure, Function, and Bioinformatics. 2005, 58(4): Yakhnenko O, Silvescu A, and Honavar V. Discriminatively Trained Markov Model for Sequence Classification. Proceedings of the IEEE Conference on Data Mining (ICDM 2005). IEEE Press. In press. 39. Kriegel HP, Kroeger P, Pryakhin A, and Schubert M. Using Support Vector Machines for Classifying Large Sets of Multi-Represented Objects. Proceedings of th 4th SIAM Int. Conf. on Data Mining, 2004: Clare A and King RD. Machine learning of functional class from phenotype data, Bioinformatics, 2002,18: Wu F, Zhang J, and Honavar V. Learning Classifiers Using Hierarchically Structured Class Taxonomies. Proceedings of the Symposium on Abstraction, Reformulation, and Approximation (SARA 2005). Edinburgh. Berlin, Springer-Verlag. In press. 42. Letovsky S and Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics. 2003, 19 (Suppl.1): Deng M, Tu Z, Sun F, Chen T. Mapping Gene Ontology to proteins based on protein-protein interaction data. Bioinformatics. 2004, 20: Lagreid A, Hvidsten TR, Midelfart H, Komorowski J, and Sandvik AK. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns Genome Res. 2003, 13(5): Hayete B and Bienkowska JR. Gotrees: predicting go associations from protein domain composition using decision trees. Pac Symp Biocomput. 2005: Prince T and Matts RL. Definition of Protein Kinase Sequence Motifs That Trigger High Affinity Binding of Hsp90 and Cdc37. J. Biol. Chem. 2004, 279: Li K, Zhao S, Karur V, and Wojchowski DM. DYRK3 Activation, Engagement of Protein Kinase A/cAMP Response Element-binding Protein, and Modulation of Progenitor Cell Survival. J. Biol. Chem. 2002, 277(49):

20 48. Kung HJ, Chen HC, Robinson D. Molecular Profiling of Tyrosine Kinases in Normal and Cancer Cells. J Biomed Sci 1998, 5: Zhu X, Kim JL, Rose PE, Stover DR, Toledo LM, Zhao H, Morgenstern KA. Structural Analysis of the Lymphocyte-Specific Kinase Lck in Complex with Non- Selective and Src Family Selective Kinase Inhibitors Structure (London.) 1999, 7: Cowell R, Dawid A, Lauritzen S, and Spiegelhalter D. Probabilistic Networks and Expert Systems. Springer; Silvescu A, Andorf C, Dobbs D, and Honavar V. Inter-element dependency models for sequence classification, Technical report, Department of Computer Science, Iowa State University, [ Mitchell T. Machine learning. New York, USA, McGraw Hill, Neshich G, Mancini AL, Yamagishi ME, Kuser PR, Fileto, et al. STINGReport: convenient web-based application for graphic and tabular presentations of proteinsequence, structure and function descriptors from the STING database. Nucleic Acids Res. 2005, 33, Database Issue: D269-D274. Figures Figure 1 - Kinase Protein Structures with Highlighted Functional Motif Candidates Structures for three proteins: the lymphocyte-specific kinase Lck [PDB: 1QPC], human cyclin-dependent kinase 2 [PDB: 1B38], and human serine/threonine kinase Pak1 [PDB: 1F3M] are shown, with four candidate functional motifs (identified by likelihood and independence ratios) highlighted. Functional motif MAPE is labelled 1 (Blue), the motif identified by our method, DVWS or DIWSL is labelled 2 (Green), the active site motif HRDL is labelled 3 (Red) and the functional motif DFG is labelled 4 (Orange). Potential non-covalent bonds among predicted motifs are shown in a contact map in the box below each structure. Using the distances between individual residues within motifs and the geometric relationships among atoms in these residues in the three-dimensional structure, possible bonds (Cα less than 4 Ǻ) could be formed between residues in different motifs. Residues are represented by circles labelled with the corresponding amino acid symbol and possible non-covalent contact bonds are represented by lines between two residues. Possible contacts can be formed between regions HRDL and DFG; HRDL and MAPE; HRDL and DIWSL; and DIWSL and MAPE. The predicted bonds were determined using the Sting Millennium Package [33]. Figure 2 - Undirected Graphical Models Graphical depiction of the dependence between the elements in a sequence of five elements using Undirected Graphical Models (for protein data, nodes represent amino acids and edges represent dependencies between amino acids): a) Naïve Bayes b) pairwise dependence (k = 2) and c) 3-way dependence (k=3)

21 Tables Table 1 - Kinase data set results Accuracy of classification (estimated by cross-validation) for the Kinase data set. Note that Naïve Bayes classifier is defined for k=1 and has the same model (and therefore accuracies) as NB 1-gram and NB(1). Experiments with SVM k-grams for k>3 are typically infeasible for large data sets because of memory requirements. Percent Identity UnBLASTable Size Naïve Bayes NB 1-gram NB 2-gram NB 3-gram NB 4-gram NB (1) NB (2) NB (3) NB (4) SVM 1-gram SVM 2-gram SVM 3-gram PSI-BLAST N/A DTree HDTree Table 2 - Kinase/Ligase data set results Accuracy of classification (estimated by cross-validation) for the Kinase/Ligase data set. Note that Naïve Bayes classifier is defined for k=1 and has the same model (and therefore accuracies) as NB 1-gram and NB(1). Experiments with SVM k-grams for k>3 are typically infeasible for large data sets because of memory requirements. Percent Identity UnBLASTable Size Naïve Bayes NB 1-gram NB 2-gram NB 3-gram NB 4-gram NB (1) NB (2) NB (3) NB (4) SVM 1-gram SVM 2-gram SVM 3-gram PSI-BLAST N/A DTree HDTree

22 Table 3 - Kinase/Ligase/Helicase/Isomerase data set results Accuracy of classification (estimated by cross-validation) for the Kinase/Ligase/Helicase/Isomerase data set. Note that Naïve Bayes classifier is defined for k=1 and has the same model (and therefore accuracies) as NB 1-gram and NB(1). Experiments with SVM k-grams for k>3 are typically infeasible for large data sets because of memory requirements. Percent Identity UnBLASTable Size Naïve Bayes NB 1-gram NB 2-gram NB 3-gram NB 4-gram NB (1) NB (2) NB (3) NB (4) SVM 1-gram SVM 2-gram SVM 3-gram PSI-BLAST N/A DTree HDTree Additional files Additional file 1 Figure1.jpg Additional file 2 Figure2.jpg

23 Figure 1

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Predicting the Binding Patterns of Hub Proteins: A Study Using Yeast Protein Interaction Networks

Predicting the Binding Patterns of Hub Proteins: A Study Using Yeast Protein Interaction Networks : A Study Using Yeast Protein Interaction Networks Carson M. Andorf 1, Vasant Honavar 1,2, Taner Z. Sen 2,3,4 * 1 Department of Computer Science, Iowa State University, Ames, Iowa, United States of America,

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

#33 - Genomics 11/09/07

#33 - Genomics 11/09/07 BCB 444/544 Required Reading (before lecture) Lecture 33 Mon Nov 5 - Lecture 31 Phylogenetics Parsimony and ML Chp 11 - pp 142 169 Genomics Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33

More information

Motif Extraction and Protein Classification

Motif Extraction and Protein Classification Motif Extraction and Protein Classification Vered Kunik 1 Zach Solan 2 Shimon Edelman 3 Eytan Ruppin 1 David Horn 2 1 School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel {kunikver,ruppin}@tau.ac.il

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Analysis of N-terminal Acetylation data with Kernel-Based Clustering Analysis of N-terminal Acetylation data with Kernel-Based Clustering Ying Liu Department of Computational Biology, School of Medicine University of Pittsburgh yil43@pitt.edu 1 Introduction N-terminal acetylation

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction Journal of Bioinformatics and Computational Biology c Imperial College Press Profile-based string kernels for remote homology detection and motif extraction Rui Kuang 1, Eugene Ie 1,3, Ke Wang 1, Kai Wang

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

Lecture Notes for Fall Network Modeling. Ernest Fraenkel Lecture Notes for 20.320 Fall 2012 Network Modeling Ernest Fraenkel In this lecture we will explore ways in which network models can help us to understand better biological data. We will explore how networks

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Naive Bayesian classifiers for multinomial features: a theoretical analysis

Naive Bayesian classifiers for multinomial features: a theoretical analysis Naive Bayesian classifiers for multinomial features: a theoretical analysis Ewald van Dyk 1, Etienne Barnard 2 1,2 School of Electrical, Electronic and Computer Engineering, University of North-West, South

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Computational Systems Biology

Computational Systems Biology Computational Systems Biology Vasant Honavar Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery

More information

METABOLIC PATHWAY PREDICTION/ALIGNMENT

METABOLIC PATHWAY PREDICTION/ALIGNMENT COMPUTATIONAL SYSTEMIC BIOLOGY METABOLIC PATHWAY PREDICTION/ALIGNMENT Hofestaedt R*, Chen M Bioinformatics / Medical Informatics, Technische Fakultaet, Universitaet Bielefeld Postfach 10 01 31, D-33501

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Abstract: We used an algorithm known as statistical coupling analysis (SCA) 1 to create a set of features for building

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT (): 195-04 (008) Radial Basis Function Neural Networks in Protein Sequence Classification Zarita Zainuddin and Maragatham Kumar School of Mathematical Sciences, University Science Malaysia, 11800 USM Pulau

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Russell Hanson DFCI April 24, 2009

Russell Hanson DFCI April 24, 2009 DFCI Boston: Using the Weighted Histogram Analysis Method (WHAM) in cancer biology and the Yeast Protein Databank (YPD); Latent Dirichlet Analysis (LDA) for biological sequences and structures Russell

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins Introduction: Benjamin Cooper, The Pennsylvania State University Advisor: Dr. Hugh Nicolas, Biomedical Initiative, Carnegie

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action Tatyana Goldberg (goldberg@rostlab.org) August 16, 2016 @ Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Mar Girolami 1 Department of Computing Science University of Glasgow girolami@dcs.gla.ac.u 1 Introduction

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Tutorial on Venn-ABERS prediction

Tutorial on Venn-ABERS prediction Tutorial on Venn-ABERS prediction Paolo Toccaceli (joint work with Alex Gammerman and Ilia Nouretdinov) Computer Learning Research Centre Royal Holloway, University of London 6th Symposium on Conformal

More information

Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles

Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles Int. J. Data Mining and Bioinformatics, Vol. 1, No. 4, 2007 337 Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles Roger A. Craig and Li Liao* Department of Computer

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction Jakob V. Hansen Department of Computer Science, University of Aarhus Ny Munkegade, Bldg. 540, DK-8000 Aarhus C,

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends

More information

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information