RF-NR: Random forest based approach for improved classification of Nuclear Receptors

Size: px
Start display at page:

Download "RF-NR: Random forest based approach for improved classification of Nuclear Receptors"

Transcription

1 1 RF-NR: Random forest based approach for improved classification of Nuclear Receptors Hamid D. Ismail, Hiroto Saigo, Dukka B KC* Abstract: The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-nr sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamily. Index Terms - nuclear receptor, protein classification, Random Forest, spectrum-kernel I. INTRODUCTION H.I. is with the Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, hismail@ncat.edu. H.S. is with the Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka-Shi, Fukuoka, Japan. saigo@bio.kyutech.ac.jp DBKC is with the Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, E- mail: dbkc@ncat.edu *: Corresponding Author The Nuclear Receptor (NR) superfamily includes a number of globular proteins that play key roles as transcriptional factors by regulating the expression of the genes involved in the metabolism of glucose and lipid, immune response, and cell division and differentiation [1]. All NR proteins have a common modular domain organization. A typical nuclear receptor consists of an N- terminal A/B domain, a conserved DNA binding domain (DBD) or region C, a linker region D, and a conserved E region that contains the ligand-binding domain (LBD). The cellular malfunction of NR proteins is implicated in many healthy conditions such as high blood pressure, high blood cholesterol level, diabetes type II, immune deficiency, and cancer. Therefore, NR proteins have recently become important drug targets [2]. Based on phylogeny, the NR superfamily has been sub-divided into eight subfamilies. Due to the large number of new protein sequences being generated in this era, the identification of NRs and their subfamilies based on the amino acid sequence information is an important problem in the field of bioinformatics. In this regard, there have been various attempts to develop computational methods to classify NRs. Bhasin and Raghava (2004) [3] developed a SVM based classification model using amino acid composition and dipeptide composition. Gao et al. [4] reconstructed the dataset, introduced the pseudo amino acid composition (PseAAC) and achieved an overall high accuracy but the number of NR subfamilies included in the classification was only four rather than eight [5]. Recently, two predictors NR-2L [6] and inr-phys [7] were proposed to perform the NR classification in two levels. In the first level, the model determines whether the protein is a NR or not and in the second level it predicts to which subfamily the protein belongs. These two predictors obtained high prediction accuracy but they still have some demerits: the dataset for the model was derived from the old version of NucleaRDB and no feature selection was applied. Most recently, NRPred-FS [8] has been proposed as it applies feature selection algorithm to reduce the feature dimensions and it has overall prediction accuracies of 97% and 93 % for the first level and second level of NR prediction respectively. Furthermore, NRfamPred [9] has been proposed which uses SVM for 2-level prediction of NR and their sub-families. Despite the progress in the prediction accuracy and the inclusion of the eight known NR subfamilies, the performance of the method is still unsatisfactory. Specifically, the evaluation of the NRPred-FS was based on a small independent test sample. Moreover, the NRPred-FS [8] which has been shown to be the best method available failed to achieve reasonable high accuracy for some of the NR subfamilies. Thus, there is still a need for the development of a new classification method that has better prediction accuracy for all the NR subfamilies. In one hand, spectrum kernel [10] was recently pro-

2 2 posed for protein sequence classification. It basically uses k-spectrum (given a number k 1) of an input sequence, which is the set of all the k-length (contiguous) subsequences it contains. Spectrum kernel is conceptually simple and efficient to compute and it performs well in comparison with other homology detection methods. Hence, we utilized spectrum-like features by decomposing amino acid sequence into Amino Acid Composition (AAC), Dipeptide Composition (DAAC) and Tripeptide Composition (TAAC). On the other hand, Random forest (RF) [11], a powerful machine learning tool that can handle large number of features, is gaining a lot of popularity. In this regard, we used the combined spectrum-like features (AAC, DAAC, and TAAC) in a RF-framework to accurately classify NRs. We call this NR classification method as RF-NR (Random Forest based Nuclear Receptor classifier). On an independent dataset, RF-NR achieved the overall accuracy close to 99.6% with a substantial improvement in the prediction of all subfamilies, which reflects that RF- NR performs better (or comparable) than all other previous methods. In summary, when a new protein sequence is given as an input, the RF-NR can either identify it as a non-nr or classify it to one of the subfamilies of the NRs. II. MATERIAL AND METHODS 2.1 Dataset Dataset In our present work, we used two datasets each containing dataset for cross-validation as well as independent dataset. The first one (Benchmark Dataset 1) is the dataset earlier published for the development of NRPred- FS [8] and the second one (Benchmark Dataset 2) is the published dataset for the development of NR-2L [6] and inr-physchem [7] and also used for NRfamPred [9]. The main difference between the two datasets is the sequence redundancy criteria used: 40% in Benchmark Dataset 1 and 60% in Benchmark Dataset 2. Benchmark Dataset 1 Benchmark Dataset 1 has 267 NRs and 1000 non-nrs. The sequences of the known NR subfamilies were obtained from the Nuclear Receptor Database (NucleaRDB) [5]. The dataset is shown in Table 1. The NR classification [12] in NucleaRDB is based on the phylogeny-based classification that organized the NR proteins into eight subfamilies (see Table 1). Each subfamily includes a number of NR proteins whose member sequences were obtained from different animal species. The initial dataset has 3016 NR sequences. CD-HIT [13], a widely used clustering program, was used to remove the redundant sequences that have a similarity of more than 40% for NR1-NR6 and more than 80% for NR0A and NR0B as in NRPred-FS [8]. Similarity cut-off of 80% was used for NR0A and NR0B because these groups had too few sequences. The final benchmark dataset contained 267 NR proteins belonging to eight subfamilies as shown in Table 1. The 1000 non-nr sequences were obtained from the authors of NRPred-FS. # Subfamily TABLE 1 BENCHMARK DATASET 1 # of sequences in NucleaRD Independent Dataset 1 In order to evaluate the real life performance of the predictor there is a need for an independent benchmark dataset. In this regard, we also compiled an Independent Dataset 1. This dataset contains all NR sequences except the ones that are in Table 1. The benchmark data contains 2749 NR sequences belonging to eight subfamilies. This dataset is shown in Column 1 and 2 of Table 6. The 1064 non-nr Protein sequences, which are not in the Dataset 1, were randomly collected from Genbank protein database [14]. These sequences were also subjected to the 40% redundancy reduction procedure using CD-hit so that none of the protein had 40% sequence identity to any other sequence in the dataset. TABLE 2 INDEPENDENT DATASET 1 # of sequence after CD-HIT 1 Thyroid like (NR1) HNF4-like (NR2) Estrogen (NR3) Nerve GF IB-like (NR4) Fushi tarazu-f1 (NR5) Germ Cell (NR6) Knirps like (NR0A) Dax like (NR0B) Non-NR protein (NNR) 1000 # Subfamily # of sequeces 1 Thyroid h-like (NR1) HNF4-like (NR2) Estrogen like (NR3) Nerve GF IB-like (NR4) Fushi tarazu-f1 like (NR5) Germ cell NF like (NR6) 30 7 Knirps like (NR0A) 18 8 Dax like (NR0B) 27 9 Non-NR proteins (NNR) 1064 Benchmark Dataset 2 In order to compare the RF-NR with other existing methods, we also use the earlier published dataset that was used for the development of the NR-2L [6] and inr- PhysChem [7] methods. This dataset was also generated from NucleaRDB. The dataset has 159 NRs and 500 non-

3 3 NRs. The redundancy criteria used in this dataset is 60%. Independent Dataset 2 We also used an independent dataset that was earlier published in the development of NR-2L [6]. This dataset has 568 NRs and 500 non-nrs. 2.2 Protein Sequence Features The goal of this step is to transform the sequence data into vectors of numerical values that can be used to learn the underlying model. This is often the most critical step that determines whether the method will be ultimately successful. Various protein descriptors have been proposed for feature extraction [15,16]. Furthermore, Spectrum Kernel [10] is one of the descriptors that were suggested for detection of remote protein homology. It uses k-spectrum (given a number k 1) of an input sequence, which is the set of all the k-length (contiguous) subsequences it contains. Spectrum kernel is conceptually simple and efficient to compute and it performs well in comparison with other protein descriptors that are used for homology detection. In this study, we will use spectrum-like features that combine discrete protein features including Amino Acid Composition (AAC), Dipeptide Amino Acid Composition (DAAC), and Tripeptide Amino Acid Composition (TAAC). The discrete features (AAC and DAAC), in fact, are relative frequencies of amino acids at certain settings, which are a notion that is associated with probabilities, whose interpretation is straightforward in terms of sequence pattern landscape. We chose this combination of protein descriptors for feature extraction with the hypothesis that it works best for NR classification, which basically relies on the homology detection that will contribute significantly to the subfamily prediction of NR sequences. With spectrum like features, each protein sequence is represented by the following feature vector: where is the AAC, is the DAAC, and is TAAC. Amino Acid Composition (AAC) Amino acid composition is one of the simplest and effective features. The AAC of a protein sequence of a length N is defined as fraction of each 20 amino acid count in the sequences divided by N. where i = 1, 2, 3,, N. Dipeptide Amino Acid Composition (DAAC) Dipeptide Amino Acid composition is a feature that captures local-order information of a protein sequence. (1) (2) DAAC is defined as where dip i represents any possible dipeptide. The total number of possible dipeptide is 20 2 =400, i = 1, 2, 3,, 400. Tripeptide Composition (TAAC) Tri-peptide Amino-Acid Composition (3-mer spectrum) of a sequence is the total count of each possible 3- mer of amino acids in the protein sequence. where trip i represents any possible tripeptide. The total number of 3-mers is 20 3 =8000, i = 1,2,3, Like in the Spectrum Kernel, one could continue adding more discrete features such as 4-mer spectrum. However, the number of feature becomes considerably high if we increase k. In the case of 4-mer, the number of the features becomes 20 4 = 160,000, which is not favored by machine learning algorithms. Hence, we did not explore 4-mer in our work. As discussed in the results section, the combination of AAC, DAAC and TAAC produces the best results. Hence, in our study we represent each protein sequence using 8420 features: 20 for AAC, 400 for DAAC, and 8000 for 3-mer. We use the propy package [17] to obtain these features. 2.3 Random Forest Since the number of features is large in this study, the random forest (RF) is the machine learning method of choice as it is equipped with a mechanism that can deal with the high dimensionality of data by selecting only the important features. The RF, which was proposed by Breiman [11] is an ensemble of decision trees that are grown using only some subsets of features and the finial classification is obtained by averaging the decisions of all trees in the forest. RF has been applied in many structural bioinformatics problems such as fold recognition [18]. It is known for its lower generalization error (GE), relative robustness to outliers and noise, and its immunity against over-fitting compared to other machine learning methods [18,19]. The randomness of bootstrap sampling enhances the prediction accuracy. Each individual decision tree in the random forest is constructed with a bootstrap sample from the training dataset. It is made up of a root node, internal nodes, and leaves. Each node represents a feature that is selected based on a criterion. A node may have two or more branches. Each branch corresponds to a range of values for that selected feature. The node branching of the decision tree is performed by computing the Gini index for each feature and only the most impor- (3) (4)

4 4 tant feature that splits the training data into the purest classes is selected to represent a node and the ranges of its values that split the sequences will be chosen as decision rules. The query sequences are classified by traversing the tree starting from the root node down to a leaf where the path is determined according to the outcome of the splitting condition at each node. Subsequently, we find to which outgoing branch the observed value of the given property corresponds to. Finally, the query sequence will be assigned to one of the eight subfamilies or to non-nr. The classification process is explained in Fig. 4. Feature importance and feature selection In RF, the Gini impurity index for each feature is calculated and considered for node split. The importance of the features is estimated as the sum of the Gini index reduction (from parent to child) over all nodes in which the specific feature is used to split. The feature that contributes to the largest Gini index reduction is the most important. Thus, the features can be ranked according to their importance. The overall importance of a feature in the forest is defined as the average of its importance value among all trees in the forest. RF parameters For better results, RF requires some parameters to be set by the user. These parameters include the maximum number of features to be considered, the maximum number of nodes in a tree, and the number of the trees in the forest. To choose the best values, different values of the parameters were used and the performance was recorded each time. Then the values that contribute to the best performance were selected. Class weights As in most one-versus-rest multi-class problem, RF may suffer due to the training of an extremely imbalanced dataset, as it is more probable that a bootstrap sample contains few or even none of the minority class, which results in a tree with poor performance for predicting the minority class. Such problem was overcome by assigning a weight to each class with the minority class given a larger weight. The class weights are used to weight the Gini index in the tree induction and are also used in the tree terminal nodes for the class prediction. The class weights are adjusted automatically as it is inversely proportional to the class frequencies in the input data. The RF is a robust learner and less prone to generalization error and overfitting. The prediction of the sequence subfamilies depends on probabilistic averaging of the decision trees rather than voting for a single subfamily. A vector of probabilities corresponding to the subfamilies will be given at each prediction process. A sequence will be assigned the most probable subfamily. The RF algorithm used in RF-NR is as follows: Given the training dataset D with size n, the number of trees in the forest (B), the maximum number of random features to select (m), and maximum leaf nodes (N): 1. FOR b 1 (where b=1 to B) 2. REPEAT 3. S b sample n instances from D with replacement. 4. Build the decision tree T b on S b with m and number of nodes N. 5. b++ 6. UNTIL b > B 2.4 Model Validation The goal of the model validation is to assess the models thoroughly for prediction accuracy. In this study two evaluation strategies were adopted; Leave-one-out cross validation (LOOCV) and independent test samples. Leave-one-out Cross validation (LOOCV) Leave-one-out Cross-validation is a model validation technique to assess how the results of a model will be generalized to an independent data set. LOOCV is a cross-validation technique in which one observation is left as the validation set and the remaining observations are used as the training set. In this regard, for our purpose we set aside one protein sequence as validation and used the remaining proteins for the training purpose. Independent test samples Independent test sample is a set of data that is independent of the data used in training the model. In addition to the LOOCV, independent test samples with known NR subfamilies were used to evaluate the classification model as well. The Independent Dataset 1 and 2 were used for this purpose. Evaluation Metrics As the NR classification is a multi-class problem, it is transformed into binary classification problem by adopting one-versus-the-rest strategy. By doing so, the RF-NR, for each subfamily, will assign either positive or negative to the test sequence giving rise to four frequencies; true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The above four frequencies were used to calculate RF- NR evaluation metrics. The metrics included accuracy, precision, sensitivity, specificity, F1 score, and Matthew s correlation coefficient (MCC). (4) (5) (6) (7)

5 Accuracy Feature importance Feature importance 5 For fair comparison between our model and the NRPred-FS model, which is one of the most recent studies, the same test samples were used to calculate the evaluation metrics for both models. For NRPred-FS, the calculation was performed using the NRPred-FS webserver [8]. The prediction results of both classification models were analyzed by counting TP, FP, TN, and FN and these four frequencies were used to calculate the evaluation metrics. III. RESULTS AND DISCUSSION 3.1 RF parameters optimization The maximum number of features A number of values were tested as maximum number of features. These values include 10% n, 50% n,,, and n, where n is the total number of features. A maximum number of features equal to was found to be corresponding the best accuracy. The number of trees in the forest Another important parameter in Random Forest is the number of trees in the forest. In order to find the best number of trees in the forest, we plotted the accuracy against the number of trees and the results are shown in Fig. 1. It can be noticed from the figure that the number of trees that achieves the best accuracy is around 200. Moreover, it can be also noticed that increasing the number of trees beyond does not lead to the improvement in the performance. Therefore, the number of trees in the forest was chosen to be Fig. 1. The number of trees vs. the accuracy. The accuracy was calculated using the Independent Dataset Feature importance and feature selection As discussed in the methods section, Gini Feature importance is utilized to estimate the feature importance. Based on this, feature selection is integrated in RF algorithm. The features are selected based on their prediction efficiency. The best features that split the data into their corresponding classes with minimal impurity indices are 96.7 (8) (9) Number of trees selected as predictors and other are ignored. Therefore, each feature will have a weight that indicates the level of its importance. Fig. 2 shows the distribution of the importance of all features. We can notice that a large number of important features are found in Tripeptide Composition region and a few are also found in Dipeptide Composition region. We can also notice that out of 8420, only few features are important and only these few features will be selected for node splitting while others will be ignored. 1.80E E E E E E E E E E Features Fig. 2. The distribution of the importance of the features. The y-axis represents the values of feature importance as described in the text. The x-axis corresponds to indices of AAC, DAAC and TAAC respectively. The first 20 features correspond to the AAC, next 400 to DAAC and the last 8000 features to TAAC. The importance of these features is calculated using the Benchmark Dataset ILR DT DLW HKV PVS YFT MFK CFC HHQ LGN Features Fig. 3. The top ten important features. These features were obtained by sorting the 8420 features based on their importance. In order to gain some insight about important features, the top ten important features sorted from the highest to the lowest are shown in Fig. 3. Upon further analysis, we observed that most of these features are located in the less conserved regions of the NRs which makes sense because the conserved region like ligand binding domain (LBD) and DNA binding domain (DBD) are less variable and almost similar in terms of structure and function while the non-conserved regions are variable and can distinguish one subfamily from another. Furthermore, a portion of a decision tree from the forest is shown in Fig. 4. The tree shows how a feature is selected for node split based on the Gini impurity index and that the Gini index is zero in the terminal nodes, which indicates that the subsets of data at the terminal

6 MCC 6 nodes are pure and have identified subfamilies. Each tree creates a set of logic rules for the path of any test data. for each subfamily is at least and the overall accuracy is Similarly, the MCC of each subfamily is at least 0.93 and the overall MCC is It is also interesting to note that our prediction accuracy for non-nr is 99.25%. Subfamily TABLE 3 PERFORMANCE OF AAC, DAAC, TAAC AND ALL AAC DAAC TAAC All Acc MCC Acc MCC Acc MCC Acc MCC NR NR NR NR NR NR NR0A NR0B Fig. 4. Part of a decision tree from the forest. The figure demonstrates how the node split takes place from the tree root (top) down to the terminal nodes (leaves). X[i] in each node indicates the feature that has been selected for split. The Gini index indicates whether the dataset in that node is pure. The non-zero Gini index means the dataset in that node level is impure and will be further split into other subsets while the zero Gini index signals that the node is terminal (leaf) and assigned a terminal class. The value of samples in the box indicates the number of samples. When a node is impure the samples will split amongst child nodes. The last line in the terminal nodes shows which class is assigned to the pure samples. As an example [ ] means that three samples were assigned to class 5. In the tree, we can also notice some of the important features. For example X[96] corresponds to the dipeptide DT that is shown in Fig Performance of the individual feature type In order to find the best combination of feature types, we compared the performance of the individual features types i.e. AAC, DAAC and TAAC, with the performance of the combination of all of them collectively (AAC+DAAC+ TAAC) (here-in RF-NR) in our framework. The performance of each of these features is shown in Table 3. The results of MCC for each of these are shown in Fig. 5. It can be clearly observed that combing the three feature types results in better prediction accuracy. Therefore, we used the combination of the three feature types in our method and represent each protein sequence using 8420 features: 20 for AAC, 400 for DAAC, and 8000 TAAC. For subsequent analysis we use 8420 features to represent a protein sequence. 3.4 Performance of the RF-NR To verify the effectiveness of the RF-NR, we performed the LOOCV on Benchmark Dataset 1. It has to be noted that protein sequences belonging to a particular sub-family were labeled as positive for that subfamily while the remaining ones were labeled as negative. The results of cross-validation are shown in Table 4. It is evident from Table 4 that the prediction accuracy of RF-NR NNR Overall All represents the combination of AAC, DAAC, and TAAC. The values were obtained based on the performance of RF-based model on Independent Dataset NR1 NR2 NR3 NR4 NR5 NR6 NR7 NR8 Non-NR Subfamilies AAC DAAC TAAC All Fig. 5. The performance in terms of MCC for the individual feature types (AAC, DAAC, TAAC) and their combination (All) Similarly, the performance of RF-NR was further evaluated on an independent dataset (independent dataset 1) that includes 2749 NR and 1064 Non-NR sequences. The results of the independent test dataset are shown in Table 5. It can be observed that for each sub-family, we were able to achieve a prediction accuracy of is at least 98.22% and our overall prediction accuracy is 99.60%. Similarly, we were able to achieve prediction MCC of at least 0.96 and overall MCC of It is also interesting to note that our prediction accuracy and MCC for non- NR is and 0.96 respectively. TABLE 4 PERFORMANCE OF RF-NR USING LOOCV (LEAVE-ONE-OUT

7 7 CROSS VALIDATION) ON BENCHMARK DATASET 1 subfamily Accuracy MCC Sensitivity Specificity NR NR NR NR NR NR NR0A NR0B NNR Overall TABLE 5 PERFORMANCE OF RF-NR ON INDEPENDENT DATASET 1 Subfamily Acc Prec Sens Spec. F1 MCC NR NR NR NR NR NR NR0A NR0B NNR Overall Comparing the RF-NR with Other Methods In order to assess the comparative performance of the RF-NR, we compared the RF-NR with various other existing methods. The main reason for having two sets of benchmark dataset is to be able to compare our method with most of these existing methods: NRPred-FS[8], NR- 2L[6], inr-physchem[7], and NRfamPred[9]. Since the Benchmark Dataset 1 is the dataset on which the NRPred-FS was based on, we compared our method with the NRPred-FS using this dataset. Similarly, the benchmark dataset 2 (used to build the inr-physchem, NR-2L and NRfamPred) would be used to compare our method with these methods. Comparing the RF-NR with NRPred-FS The results of the independent test samples for our method (RF-NR) and the NRPred-FS using the independent dataset 1 are summarized in Table 6 in the form of a confusion matrix. The column labeled correct for each shows the number of sequences that were correctly identified while the one that is labeled incorrect/subfamily shows the number of sequences that were incorrectly predicted and the incorrectly predicted subfamily. The column ACC denotes the accuracy of each method in percentage. It can be seen that for all the NR subfamilies and non-nr proteins, the accuracy of RF-NR is higher than NRPred-FS method. TABLE 6 COMPARATIVE RESULTS USING INDEPENDENT DATASET 1 FOR RF-NR AND NRPRED-FS Subfamily Total correct RF-NR Incorrect/ subfamily ACC correct NR /NNR NRPred-FS Incorrect/ subfamily 2/NR2, 54/NNR ACC NR /NNR /NNR /NR1, 17/NR2, NR /NNR /NNR 45/NR1, 1/NR2, NR /NNR NR /NNR /NR1, 25/NR NR NR0A NR0B /NR NNR /NR1, 108/NR O verall Furthermore, it can be observed from the table that, for subfamily NR1, the NRPred-FS method correctly predicted only 1034 proteins out of 1090 whereas it incorrectly predicted 2 proteins to be NR2 and 54 proteins as non-nr. In contrast, RF-NR correctly predicted 1053 proteins as NR1 and only 37 proteins were predicted to be non-nr. It is also interesting to note that RF-NR has no trouble distinguishing the subfamilies (except that sometimes incorrectly assigns them to non-nr), whereas NRPred-FS has trouble distinguishing some subfamilies. As an example out of 671 NR3 sequences (see Table 6) NRPred-FS could only classify 542 sequences correctly whereas it incorrectly classified 94 proteins as NR1 and 17 proteins as NR2 and 18 as non-nr. The confusion matrix shows clearly that the NRPred-FS model has difficulty in predicting NR3, NR4, and NR5 and in discriminating NRs from non-nrs. Fig. 6 shows the MCC for the various NR subfamilies and non-nr for both RF-NR and NRPred-FS based on independent test set 1. It is clear from Fig. 6 that RF-NR performs better than NRPred-FS. The relatively lower MCC scores for NR4 and NR5 of NRPred-FS are attributed to the fact that this method incorrectly assigned large number of NR4 and NR5 sequences to other subfamilies. The average MCC for RF-NR and NRPred-FS is 0.98 and 0.87 respectively. Other various metrics for the comparison of RF-NR and NRPred-FS are shown in Fig. 7. MCC is shown as a percentage in order to be in the same scale as the other metrics. It can be observed that among the various metrics presented, the RF-NR shows better performance than the NRPred-FS.

8 MCC 8 Fig. 6. MCC of RF-NR and NRPred-FS on the independent dataset NR1 NR2 NR3 NR4 NR5 NR6 NR0A NR0B NNR Subfamilies RF-NR NRPred-FS Accuracy Precision Sensitivity Specificity F1 score MCC RF-NR NRPred-FS Fig. 7. The overall metrics of both RF-NR and NRPred-FS on the independent dataset 1 (MCC in % scale) As an example for the improvement in the performance, we provide, as supplementary documents, the sequences that were classified correctly by the RF-NR but misclassified by NRPred-FS (see the supplementary sequence files NR0B_supp and NR1_supp to NR5_supp). The numbers of these sequences are 1, 54, 5, 125, 48, and 80 for the subfamilies NR0B, NR1, NR2, NR3, NR4, and NR5 respectively. As shown by Table 6, we can observe a significant improvement in the NR subfamily prediction with the RF-NR. This improvement can be explained by the fact that RF relies on the popular decision that is taken according to a strict decision rule based on the information extracted from the sequences rather than relying on a hyperplane, as in SVM, that separates the data points into categories. Such strategy renders RF less prone to confusion as the popular decision assigns the most probable subfamily whereas the separating hyperplane requires clear separable data points. Moreover, the spectrum-like features are known for their strength to detect the homology that characterizes the NR subfamilies. Generally we can say that both the feature extraction method and machine learning technique contributed to the improvement in the performance of the RF-NR. While spectrum-like features could outline the pattern landscape of the information flow along an individual training sequence and cross the entire training sequences exposing the boundaries between the close homologous sets of sequences, the RF could select the most important features that create the most prominent boundaries. As mentioned above and shown in Fig 2 and Fig 3 most important features are found in the most variant regions in NR sequences rather than in the conservative zones such as DBD and PBD. To contrast RF with SVM and to support our above hypothesis, the same dataset generated with spectrumlike features and used to train and test RF-NR were used, as well, to train and test a SVM model. The evaluation metrics generated with the independent test sequences are provided as supplementary document (see supplementary Table A). Comparing the RF-NR with the NRfamPred, inr- PhysChem and NR-2L In addition, we also compared the performance of RF- NR to NRfamPred [9], inr-physchem [7], and NR-2L [6] using Benchmark Dataset 2 because Benchmark Dataset 2 and Independent dataset 2 were used for the development of these methods. The corresponding values for Sensitivity and MCC for these methods were obtained from the NRfamPred paper [9]. The comparison of LOOCV performance of RF-NR, NRfamPred, inr- PhysChem and NR-2L is shown in Table 7. TABLE 7 PERFORMANCE COMPARISON OF RF-NR, NRFAMPRED, INR- PHYSCHEM AND NR-2L ON THE BENCHMARK DATASET 2 Subfamily RF-NR NRfamPred inr-physchem NR-2L Sens. MCC Sens. MCC Sens. MCC Sens. MCC NR NR NR NR NR NR NR Overall It can be observed from Table 7 that the RF-NR achieved 95.54% sensitivity and MCC of 0.94 which is the highest among all the compared methods. It is to be noted here that RF-NR has the highest sensitivity for all subfamily except subfamily NR1. The same is true for MCC. Furthermore, in order to compare the performance of these methods on a blind dataset, the Independent Dataset 2 was utilized. The results for RF-NR, NRfamPred and NR-2L are shown in Table 8. It can be observed that the RF-NR performs equally well as these other state-ofthe art method in real-life prediction scenario. The added advantage of our method compared to these methods is that we can get some insights about the importance of the features. TABLE 8 PERFORMANCE COMPARISON OF RF-NR, NRFAMPRED AND

9 9 Subfamily NR-2L ON INDEPENDENT DATASET 2 RF-NR NRfamPred NR-2L Sens MCC Sens MCC Sens MCC NR NR NR NR NR NR NR O verall IV. CONCLUSION AND DISCUSSION In this study, we developed a Random Forest based method (RF-NR) to identify NR proteins and subsequently classify them into respective NR subfamilies. The NR classification problem is posed as a multi-class classification problem and solved as one-vs.-rest strategy. The number of random trees in the forest was set to 200 based on improved prediction accuracy. The weighted class strategy is used to offset the imbalanced class problem. The RF-NR is able to predict with nearoptimal accuracy whether a query protein sequence belongs to one of the eight subfamilies or to the non-nr group. The new NR classification method used a combination of a discrete amino acid composition of different settings, which is a parsimonious technique, to extract the pattern information conserved due to the homology of the nuclear receptors. Essentially, the method uses spectrum-like features (k-mer features), namely AAC, DAAC and TAAC. The number of features is considerably high compared to the number of sequences, which may raise some concerns for those who are used to other machine learning algorithms such as support vector machine. In contrary, the RF can handle high dimensionality in elegant way by adopting a feature selection technique that ensures only important features that have strong prediction capability are selected (see Fig. 2, Fig. 3, and Fig. 4). Another concern is the overfitting, in which the model fits the test dataset well but it fits poorly when datasets other than the ones used in the study are used. Most overfitting problems come due to the fact that the dataset used for testing was used for the training as well. The training datasets used in this study were filtered from the closely similar and redundant sequences as explained in the dataset section. The test sequences, which were used for evaluation, are the sequences that were not included in the training dataset. Furthermore, two sets of datasets with varying redundancy reduction cut-offs (40% and 60%) were utilized to ensure that the higher value of prediction performance is not due to the high sequence similarity of the dataset. Moreover, to avoid bias by restricting the evaluation to a subset of sequences, as the previous studies did, we exhausted all the remaining sequences in NucleaRDB database to be sure about the generalization of our model. Furthermore, RF is less prone to overfitting as demonstrated empirically by Brieman [11]. We also examined the possibility of creating third dataset with 30% redundancy reduction cut-off but we were unsuccessful to train the model because of the very few numbers of samples left in each subfamily of NR proteins. The prediction of the subfamily is conducted by probabilistic averaging of all classifiers, which makes the prediction decision very robust and less affected by the noise and outliers. Moreover, the RF-NR uses Gini index as a criterion to select the features for node splitting and it uses the out-of-bag (OOB) as a build-in error estimate. The method was systematically validated with cross validation and independent test samples using two sets of datasets that have varying sequence redundancy reduction criteria. Performance on independent dataset and the comparative study between the RF-NR and NR-2L, inr- PhysChem, NRPred-FS and NRfamPred also proved that RF-NR performs equally well or better than some of the best predictors. In conclusion, we were able to develop an accurate, precise, and specific NR classification methods compared to state-of-the-art existing methods. A web site will be developed soon to serve the scientific community. V. ACKNOWLEDGMENTS The authors would like to thank the developer of NucleaRDB and Dr. Xiao for providing the dataset and the web server for NRPred-FS. DBKC is partly supported by a startup grant from the Department of Computational Science and Engineering at North Carolina A&T State University. DBKC is also partly supported by the National Science Foundation under Cooperative Agreement No. DBI VI. REFERENCES [1] H. Gronemeyer, J.A. Gustafason, and V. Laudet, Principles for modulation of the nuclear receptor superfamily, Nat Rev Drug Discov 3: , [2] A.L. Hopkins and C.R. Groom, The druggable genome, Nat. Rev. Drug. Discov 1: , [3] Nuclear Receptors Nomenclature Committee, A unified nomenclature system for the nuclear receptor superfamily, Cell 97 (2): doi: /s (00) , [4] Q.B. Gao, Z.C. Jin, X.F. Ye, C. Wu, and J. He, Prediction of nuclear receptors with optimal pseudo amino acid composition, Anal Biochem 387: 54-59, [5] B. Vroling, D. Thorne, P. McDermott, H.J. Joosten, T.K. Attwood, et al., NucleaRDB: Information System for Nuclear Receptors, Nucleic Acids Res 29: , 2012.

10 10 [6] P. Wang P, X. Xiao, and K.C. Chou, NR-2L: A Two- Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features, PLoS ONE 6(8): e doi: /journal.pone , [7] X. Xiao, P. Wang, and K.C. Chou, inr- PhysChem: A Sequence-based Predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix, Plos One, 7:e30869, [8] P. Wang and X. Xiao, NRPred-FS: A Feature Selection based Two-level Predictor for Nuclear Receptors, J Proteomics Bioinform S9: 002. doi: /jpb.s9-002, [9] R. Kumar, B. Kumari, A. Srivastava, M. Kumar, NR NRfamPred: a proteome-scale two level method for prediction of nuclear receptor proteins and their subfamilies. Sci Rep 4: 6810, [10] C. Leslie, E. Eskin, and W.S. Noble, The Spectrum Kernel: A String Kernel for SVM Protein Classification, Proceeding of Pacific Symposium in Bioinformatics (PSB), 7: , [11] L. Breiman, Random Forests, Machine Learning, 45(1):5-32, [12] V. Laudet, "Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor, J. Mol. Endocrinol. 19 (3): doi: /jme PMID , [13] H. Ying, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT Suite: a web server for clustering and comparing biological sequences, Bioinformatics, 26: , [14] D. A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell J, and D.L. Wheeler, GenBank, Nucleic Acids Research, 33: D34-D38, [15] K.C. Chou, Pseudo Amino Acid Composition and its Applications in Bioinformatics Proteomics and System Biology, Curr Proteomics, 6: , [16] K. Nishikawa, Y. Kubota, and T. Ooi, Classification of Proteins into Groups Based on Amino Acid Composition and Other Characters, J Biochem, 94: , [17] D.S. Cao, Q.S. Xu, and Y.Z. Liang, propy: a tool to generate various modes of Chou's PseAAC, Bioinformatics 29: , [18] T. Jo, and J. Cheng, Improving Fold Recognition by Random Forest, BMC Bioinformatics, 15(Suppl 11):S14, [19] F. Pedregosa, C.Varoquaux,, E. Duchesnay, Scikit-learn: Machine Learning in Python, JMLR, 12: , Hamid D. Ismail received DVM and Higher Diploma in Computer Programming and Statistics from University of Khartoum, and M.Sc. in computational science and engineering from NC A&T State University. He is a SAS Certified Advanced Programmer and Certified Oracle SQL expert. He worked as statistician and software developer with several companies. He is working currently as a research associate at NC A&T State University and pursuing a doctoral degree in computational science and engineering at NC A&T State University. Hiroto Saigo received the BS degree in electrical and electronics engineering from Sophia University in He received Master's degree and Doctor's degree both in informatics from Kyoto University in 2003 and 2006, respectively. He was also a visiting student at University of California, Irvine (UCI). He worked as a search scientist at Max Planck Institute (MPI) for Biological Cybernetics from 2008 to 2010, and MPI for Informatics from 2008 to Since 2010, he is an associate professor at Kyushu Institute of Technology. His research focuses on development of statistical machine learning methods tailored for problems in bioinformatics and cheminformatics. He serves as a reviewer/pc-member for numerous top-level conferences and journals. Dukka B KC received a B.E. in Computer Science a, M.Inf. in Bioinformatics and PhD in Bioinformatics from Kyoto University in 2001, 2003 and 2006, respectively. From 2006 to 2007, he was a postdoctoral fellow at Georgia Institute of Technology. From he was a postdoctoral fellow at the University of North Carolina at Charlotte. From he was a CRTA fellow at National Cancer Institute and Bioinformatics Scientist at Center for Information Technology at the National Institutes of Health. Currently, he is an assistant professor at North Carolina A&T State University. His research interests are in developing algorithms for deciphering sequence/structure/function evolution relationship of proteins.

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Accurate prediction of nuclear receptors with conjoint triad feature

Accurate prediction of nuclear receptors with conjoint triad feature Wang and Hu BMC Bioinformatics (2015) 16:402 DOI 10.1186/s12859-015-0828-1 METHODOLOGY ARTICLE Open Access Accurate prediction of nuclear receptors with conjoint triad feature Hongchu Wang 1 and Xuehai

More information

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Definitions Data. Consider a set A = {A 1,...,A n } of attributes, and an additional

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

Decision Tree And Random Forest

Decision Tree And Random Forest Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008 Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography:

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Supplementary Materials for mplr-loc Web-server

Supplementary Materials for mplr-loc Web-server Supplementary Materials for mplr-loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to mplr-loc Server Contents 1 Introduction to mplr-loc

More information

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations Review of Lecture 1 This course is about finding novel actionable patterns in data. We can divide data mining algorithms (and the patterns they find) into five groups Across records Classification, Clustering,

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Y-c Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1 Last time 2 3 Yet another

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Classification using stochastic ensembles

Classification using stochastic ensembles July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

A model for the evaluation of domain based classification of GPCR

A model for the evaluation of domain based classification of GPCR 4(4): 138-142 (2009) 138 A model for the evaluation of domain based classification of GPCR Tannu Kumari *, Bhaskar Pant, Kamalraj Raj Pardasani Department of Mathematics, MANIT, Bhopal - 462051, India;

More information

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins

Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins Mol Divers (2008) 12:41 45 DOI 10.1007/s11030-008-9073-0 FULL LENGTH PAPER Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins Bing Niu Yu-Huan Jin Kai-Yan

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Random Forests: Finding Quasars

Random Forests: Finding Quasars This is page i Printer: Opaque this Random Forests: Finding Quasars Leo Breiman Michael Last John Rice Department of Statistics University of California, Berkeley 0.1 Introduction The automatic classification

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes dr. Petra Kralj Novak Petra.Kralj.Novak@ijs.si 7.11.2017 1 Course Prof. Bojan Cestnik Data preparation Prof. Nada Lavrač: Data mining overview Advanced

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Supplementary Information

Supplementary Information Supplementary Information Performance measures A binary classifier, such as SVM, assigns to predicted binding sequences the positive class label (+1) and to sequences predicted as non-binding the negative

More information

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Recitation 8 Oct 21, Oznur Tastan Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

More information

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last

More information

Machine Learning on temporal data

Machine Learning on temporal data Machine Learning on temporal data Classification rees for ime Series Ahlame Douzal (Ahlame.Douzal@imag.fr) AMA, LIG, Université Joseph Fourier Master 2R - MOSIG (2011) Plan ime Series classification approaches

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

Supplementary Materials for R3P-Loc Web-server

Supplementary Materials for R3P-Loc Web-server Supplementary Materials for R3P-Loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to R3P-Loc Server Contents 1 Introduction to R3P-Loc

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

Predicting flight on-time performance

Predicting flight on-time performance 1 Predicting flight on-time performance Arjun Mathur, Aaron Nagao, Kenny Ng I. INTRODUCTION Time is money, and delayed flights are a frequent cause of frustration for both travellers and airline companies.

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Random Forests for Ordinal Response Data: Prediction and Variable Selection

Random Forests for Ordinal Response Data: Prediction and Variable Selection Silke Janitza, Gerhard Tutz, Anne-Laure Boulesteix Random Forests for Ordinal Response Data: Prediction and Variable Selection Technical Report Number 174, 2014 Department of Statistics University of Munich

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

Decision Support. Dr. Johan Hagelbäck.

Decision Support. Dr. Johan Hagelbäck. Decision Support Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Decision Support One of the earliest AI problems was decision support The first solution to this problem was expert systems

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Meta-Learning for Escherichia Coli Bacteria Patterns Classification

Meta-Learning for Escherichia Coli Bacteria Patterns Classification Meta-Learning for Escherichia Coli Bacteria Patterns Classification Hafida Bouziane, Belhadri Messabih, and Abdallah Chouarfia MB University, BP 1505 El M Naouer 3100 Oran Algeria e-mail: (h_bouziane,messabih,chouarfia)@univ-usto.dz

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18 Decision Tree Analysis for Classification Problems Entscheidungsunterstützungssysteme SS 18 Supervised segmentation An intuitive way of thinking about extracting patterns from data in a supervised manner

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action Tatyana Goldberg (goldberg@rostlab.org) August 16, 2016 @ Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

More information

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,

More information

Classification and Prediction

Classification and Prediction Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model

More information

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 1 2 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 2 An experimental bias variance analysis of SVM ensembles based on resampling

More information