RF-NR: Random forest based approach for improved classification of Nuclear Receptors

Size: px

Start display at page:

Download "RF-NR: Random forest based approach for improved classification of Nuclear Receptors"

Suzanna Ford
6 years ago
Views:

1 1 RF-NR: Random forest based approach for improved classification of Nuclear Receptors Hamid D. Ismail, Hiroto Saigo, Dukka B KC* Abstract: The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-nr sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamily. Index Terms - nuclear receptor, protein classification, Random Forest, spectrum-kernel I. INTRODUCTION H.I. is with the Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, hismail@ncat.edu. H.S. is with the Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka-Shi, Fukuoka, Japan. saigo@bio.kyutech.ac.jp DBKC is with the Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, E- mail: dbkc@ncat.edu *: Corresponding Author The Nuclear Receptor (NR) superfamily includes a number of globular proteins that play key roles as transcriptional factors by regulating the expression of the genes involved in the metabolism of glucose and lipid, immune response, and cell division and differentiation [1]. All NR proteins have a common modular domain organization. A typical nuclear receptor consists of an N- terminal A/B domain, a conserved DNA binding domain (DBD) or region C, a linker region D, and a conserved E region that contains the ligand-binding domain (LBD). The cellular malfunction of NR proteins is implicated in many healthy conditions such as high blood pressure, high blood cholesterol level, diabetes type II, immune deficiency, and cancer. Therefore, NR proteins have recently become important drug targets [2]. Based on phylogeny, the NR superfamily has been sub-divided into eight subfamilies. Due to the large number of new protein sequences being generated in this era, the identification of NRs and their subfamilies based on the amino acid sequence information is an important problem in the field of bioinformatics. In this regard, there have been various attempts to develop computational methods to classify NRs. Bhasin and Raghava (2004) [3] developed a SVM based classification model using amino acid composition and dipeptide composition. Gao et al. [4] reconstructed the dataset, introduced the pseudo amino acid composition (PseAAC) and achieved an overall high accuracy but the number of NR subfamilies included in the classification was only four rather than eight [5]. Recently, two predictors NR-2L [6] and inr-phys [7] were proposed to perform the NR classification in two levels. In the first level, the model determines whether the protein is a NR or not and in the second level it predicts to which subfamily the protein belongs. These two predictors obtained high prediction accuracy but they still have some demerits: the dataset for the model was derived from the old version of NucleaRDB and no feature selection was applied. Most recently, NRPred-FS [8] has been proposed as it applies feature selection algorithm to reduce the feature dimensions and it has overall prediction accuracies of 97% and 93 % for the first level and second level of NR prediction respectively. Furthermore, NRfamPred [9] has been proposed which uses SVM for 2-level prediction of NR and their sub-families. Despite the progress in the prediction accuracy and the inclusion of the eight known NR subfamilies, the performance of the method is still unsatisfactory. Specifically, the evaluation of the NRPred-FS was based on a small independent test sample. Moreover, the NRPred-FS [8] which has been shown to be the best method available failed to achieve reasonable high accuracy for some of the NR subfamilies. Thus, there is still a need for the development of a new classification method that has better prediction accuracy for all the NR subfamilies. In one hand, spectrum kernel [10] was recently pro-

2 2 posed for protein sequence classification. It basically uses k-spectrum (given a number k 1) of an input sequence, which is the set of all the k-length (contiguous) subsequences it contains. Spectrum kernel is conceptually simple and efficient to compute and it performs well in comparison with other homology detection methods. Hence, we utilized spectrum-like features by decomposing amino acid sequence into Amino Acid Composition (AAC), Dipeptide Composition (DAAC) and Tripeptide Composition (TAAC). On the other hand, Random forest (RF) [11], a powerful machine learning tool that can handle large number of features, is gaining a lot of popularity. In this regard, we used the combined spectrum-like features (AAC, DAAC, and TAAC) in a RF-framework to accurately classify NRs. We call this NR classification method as RF-NR (Random Forest based Nuclear Receptor classifier). On an independent dataset, RF-NR achieved the overall accuracy close to 99.6% with a substantial improvement in the prediction of all subfamilies, which reflects that RF- NR performs better (or comparable) than all other previous methods. In summary, when a new protein sequence is given as an input, the RF-NR can either identify it as a non-nr or classify it to one of the subfamilies of the NRs. II. MATERIAL AND METHODS 2.1 Dataset Dataset In our present work, we used two datasets each containing dataset for cross-validation as well as independent dataset. The first one (Benchmark Dataset 1) is the dataset earlier published for the development of NRPred- FS [8] and the second one (Benchmark Dataset 2) is the published dataset for the development of NR-2L [6] and inr-physchem [7] and also used for NRfamPred [9]. The main difference between the two datasets is the sequence redundancy criteria used: 40% in Benchmark Dataset 1 and 60% in Benchmark Dataset 2. Benchmark Dataset 1 Benchmark Dataset 1 has 267 NRs and 1000 non-nrs. The sequences of the known NR subfamilies were obtained from the Nuclear Receptor Database (NucleaRDB) [5]. The dataset is shown in Table 1. The NR classification [12] in NucleaRDB is based on the phylogeny-based classification that organized the NR proteins into eight subfamilies (see Table 1). Each subfamily includes a number of NR proteins whose member sequences were obtained from different animal species. The initial dataset has 3016 NR sequences. CD-HIT [13], a widely used clustering program, was used to remove the redundant sequences that have a similarity of more than 40% for NR1-NR6 and more than 80% for NR0A and NR0B as in NRPred-FS [8]. Similarity cut-off of 80% was used for NR0A and NR0B because these groups had too few sequences. The final benchmark dataset contained 267 NR proteins belonging to eight subfamilies as shown in Table 1. The 1000 non-nr sequences were obtained from the authors of NRPred-FS. # Subfamily TABLE 1 BENCHMARK DATASET 1 # of sequences in NucleaRD Independent Dataset 1 In order to evaluate the real life performance of the predictor there is a need for an independent benchmark dataset. In this regard, we also compiled an Independent Dataset 1. This dataset contains all NR sequences except the ones that are in Table 1. The benchmark data contains 2749 NR sequences belonging to eight subfamilies. This dataset is shown in Column 1 and 2 of Table 6. The 1064 non-nr Protein sequences, which are not in the Dataset 1, were randomly collected from Genbank protein database [14]. These sequences were also subjected to the 40% redundancy reduction procedure using CD-hit so that none of the protein had 40% sequence identity to any other sequence in the dataset. TABLE 2 INDEPENDENT DATASET 1 # of sequence after CD-HIT 1 Thyroid like (NR1) HNF4-like (NR2) Estrogen (NR3) Nerve GF IB-like (NR4) Fushi tarazu-f1 (NR5) Germ Cell (NR6) Knirps like (NR0A) Dax like (NR0B) Non-NR protein (NNR) 1000 # Subfamily # of sequeces 1 Thyroid h-like (NR1) HNF4-like (NR2) Estrogen like (NR3) Nerve GF IB-like (NR4) Fushi tarazu-f1 like (NR5) Germ cell NF like (NR6) 30 7 Knirps like (NR0A) 18 8 Dax like (NR0B) 27 9 Non-NR proteins (NNR) 1064 Benchmark Dataset 2 In order to compare the RF-NR with other existing methods, we also use the earlier published dataset that was used for the development of the NR-2L [6] and inr- PhysChem [7] methods. This dataset was also generated from NucleaRDB. The dataset has 159 NRs and 500 non-

3 3 NRs. The redundancy criteria used in this dataset is 60%. Independent Dataset 2 We also used an independent dataset that was earlier published in the development of NR-2L [6]. This dataset has 568 NRs and 500 non-nrs. 2.2 Protein Sequence Features The goal of this step is to transform the sequence data into vectors of numerical values that can be used to learn the underlying model. This is often the most critical step that determines whether the method will be ultimately successful. Various protein descriptors have been proposed for feature extraction [15,16]. Furthermore, Spectrum Kernel [10] is one of the descriptors that were suggested for detection of remote protein homology. It uses k-spectrum (given a number k 1) of an input sequence, which is the set of all the k-length (contiguous) subsequences it contains. Spectrum kernel is conceptually simple and efficient to compute and it performs well in comparison with other protein descriptors that are used for homology detection. In this study, we will use spectrum-like features that combine discrete protein features including Amino Acid Composition (AAC), Dipeptide Amino Acid Composition (DAAC), and Tripeptide Amino Acid Composition (TAAC). The discrete features (AAC and DAAC), in fact, are relative frequencies of amino acids at certain settings, which are a notion that is associated with probabilities, whose interpretation is straightforward in terms of sequence pattern landscape. We chose this combination of protein descriptors for feature extraction with the hypothesis that it works best for NR classification, which basically relies on the homology detection that will contribute significantly to the subfamily prediction of NR sequences. With spectrum like features, each protein sequence is represented by the following feature vector: where is the AAC, is the DAAC, and is TAAC. Amino Acid Composition (AAC) Amino acid composition is one of the simplest and effective features. The AAC of a protein sequence of a length N is defined as fraction of each 20 amino acid count in the sequences divided by N. where i = 1, 2, 3,, N. Dipeptide Amino Acid Composition (DAAC) Dipeptide Amino Acid composition is a feature that captures local-order information of a protein sequence. (1) (2) DAAC is defined as where dip i represents any possible dipeptide. The total number of possible dipeptide is 20 2 =400, i = 1, 2, 3,, 400. Tripeptide Composition (TAAC) Tri-peptide Amino-Acid Composition (3-mer spectrum) of a sequence is the total count of each possible 3- mer of amino acids in the protein sequence. where trip i represents any possible tripeptide. The total number of 3-mers is 20 3 =8000, i = 1,2,3, Like in the Spectrum Kernel, one could continue adding more discrete features such as 4-mer spectrum. However, the number of feature becomes considerably high if we increase k. In the case of 4-mer, the number of the features becomes 20 4 = 160,000, which is not favored by machine learning algorithms. Hence, we did not explore 4-mer in our work. As discussed in the results section, the combination of AAC, DAAC and TAAC produces the best results. Hence, in our study we represent each protein sequence using 8420 features: 20 for AAC, 400 for DAAC, and 8000 for 3-mer. We use the propy package [17] to obtain these features. 2.3 Random Forest Since the number of features is large in this study, the random forest (RF) is the machine learning method of choice as it is equipped with a mechanism that can deal with the high dimensionality of data by selecting only the important features. The RF, which was proposed by Breiman [11] is an ensemble of decision trees that are grown using only some subsets of features and the finial classification is obtained by averaging the decisions of all trees in the forest. RF has been applied in many structural bioinformatics problems such as fold recognition [18]. It is known for its lower generalization error (GE), relative robustness to outliers and noise, and its immunity against over-fitting compared to other machine learning methods [18,19]. The randomness of bootstrap sampling enhances the prediction accuracy. Each individual decision tree in the random forest is constructed with a bootstrap sample from the training dataset. It is made up of a root node, internal nodes, and leaves. Each node represents a feature that is selected based on a criterion. A node may have two or more branches. Each branch corresponds to a range of values for that selected feature. The node branching of the decision tree is performed by computing the Gini index for each feature and only the most impor- (3) (4)

4 4 tant feature that splits the training data into the purest classes is selected to represent a node and the ranges of its values that split the sequences will be chosen as decision rules. The query sequences are classified by traversing the tree starting from the root node down to a leaf where the path is determined according to the outcome of the splitting condition at each node. Subsequently, we find to which outgoing branch the observed value of the given property corresponds to. Finally, the query sequence will be assigned to one of the eight subfamilies or to non-nr. The classification process is explained in Fig. 4. Feature importance and feature selection In RF, the Gini impurity index for each feature is calculated and considered for node split. The importance of the features is estimated as the sum of the Gini index reduction (from parent to child) over all nodes in which the specific feature is used to split. The feature that contributes to the largest Gini index reduction is the most important. Thus, the features can be ranked according to their importance. The overall importance of a feature in the forest is defined as the average of its importance value among all trees in the forest. RF parameters For better results, RF requires some parameters to be set by the user. These parameters include the maximum number of features to be considered, the maximum number of nodes in a tree, and the number of the trees in the forest. To choose the best values, different values of the parameters were used and the performance was recorded each time. Then the values that contribute to the best performance were selected. Class weights As in most one-versus-rest multi-class problem, RF may suffer due to the training of an extremely imbalanced dataset, as it is more probable that a bootstrap sample contains few or even none of the minority class, which results in a tree with poor performance for predicting the minority class. Such problem was overcome by assigning a weight to each class with the minority class given a larger weight. The class weights are used to weight the Gini index in the tree induction and are also used in the tree terminal nodes for the class prediction. The class weights are adjusted automatically as it is inversely proportional to the class frequencies in the input data. The RF is a robust learner and less prone to generalization error and overfitting. The prediction of the sequence subfamilies depends on probabilistic averaging of the decision trees rather than voting for a single subfamily. A vector of probabilities corresponding to the subfamilies will be given at each prediction process. A sequence will be assigned the most probable subfamily. The RF algorithm used in RF-NR is as follows: Given the training dataset D with size n, the number of trees in the forest (B), the maximum number of random features to select (m), and maximum leaf nodes (N): 1. FOR b 1 (where b=1 to B) 2. REPEAT 3. S b sample n instances from D with replacement. 4. Build the decision tree T b on S b with m and number of nodes N. 5. b++ 6. UNTIL b > B 2.4 Model Validation The goal of the model validation is to assess the models thoroughly for prediction accuracy. In this study two evaluation strategies were adopted; Leave-one-out cross validation (LOOCV) and independent test samples. Leave-one-out Cross validation (LOOCV) Leave-one-out Cross-validation is a model validation technique to assess how the results of a model will be generalized to an independent data set. LOOCV is a cross-validation technique in which one observation is left as the validation set and the remaining observations are used as the training set. In this regard, for our purpose we set aside one protein sequence as validation and used the remaining proteins for the training purpose. Independent test samples Independent test sample is a set of data that is independent of the data used in training the model. In addition to the LOOCV, independent test samples with known NR subfamilies were used to evaluate the classification model as well. The Independent Dataset 1 and 2 were used for this purpose. Evaluation Metrics As the NR classification is a multi-class problem, it is transformed into binary classification problem by adopting one-versus-the-rest strategy. By doing so, the RF-NR, for each subfamily, will assign either positive or negative to the test sequence giving rise to four frequencies; true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The above four frequencies were used to calculate RF- NR evaluation metrics. The metrics included accuracy, precision, sensitivity, specificity, F1 score, and Matthew s correlation coefficient (MCC). (4) (5) (6) (7)

5 Accuracy Feature importance Feature importance 5 For fair comparison between our model and the NRPred-FS model, which is one of the most recent studies, the same test samples were used to calculate the evaluation metrics for both models. For NRPred-FS, the calculation was performed using the NRPred-FS webserver [8]. The prediction results of both classification models were analyzed by counting TP, FP, TN, and FN and these four frequencies were used to calculate the evaluation metrics. III. RESULTS AND DISCUSSION 3.1 RF parameters optimization The maximum number of features A number of values were tested as maximum number of features. These values include 10% n, 50% n,,, and n, where n is the total number of features. A maximum number of features equal to was found to be corresponding the best accuracy. The number of trees in the forest Another important parameter in Random Forest is the number of trees in the forest. In order to find the best number of trees in the forest, we plotted the accuracy against the number of trees and the results are shown in Fig. 1. It can be noticed from the figure that the number of trees that achieves the best accuracy is around 200. Moreover, it can be also noticed that increasing the number of trees beyond does not lead to the improvement in the performance. Therefore, the number of trees in the forest was chosen to be Fig. 1. The number of trees vs. the accuracy. The accuracy was calculated using the Independent Dataset Feature importance and feature selection As discussed in the methods section, Gini Feature importance is utilized to estimate the feature importance. Based on this, feature selection is integrated in RF algorithm. The features are selected based on their prediction efficiency. The best features that split the data into their corresponding classes with minimal impurity indices are 96.7 (8) (9) Number of trees selected as predictors and other are ignored. Therefore, each feature will have a weight that indicates the level of its importance. Fig. 2 shows the distribution of the importance of all features. We can notice that a large number of important features are found in Tripeptide Composition region and a few are also found in Dipeptide Composition region. We can also notice that out of 8420, only few features are important and only these few features will be selected for node splitting while others will be ignored. 1.80E E E E E E E E E E Features Fig. 2. The distribution of the importance of the features. The y-axis represents the values of feature importance as described in the text. The x-axis corresponds to indices of AAC, DAAC and TAAC respectively. The first 20 features correspond to the AAC, next 400 to DAAC and the last 8000 features to TAAC. The importance of these features is calculated using the Benchmark Dataset ILR DT DLW HKV PVS YFT MFK CFC HHQ LGN Features Fig. 3. The top ten important features. These features were obtained by sorting the 8420 features based on their importance. In order to gain some insight about important features, the top ten important features sorted from the highest to the lowest are shown in Fig. 3. Upon further analysis, we observed that most of these features are located in the less conserved regions of the NRs which makes sense because the conserved region like ligand binding domain (LBD) and DNA binding domain (DBD) are less variable and almost similar in terms of structure and function while the non-conserved regions are variable and can distinguish one subfamily from another. Furthermore, a portion of a decision tree from the forest is shown in Fig. 4. The tree shows how a feature is selected for node split based on the Gini impurity index and that the Gini index is zero in the terminal nodes, which indicates that the subsets of data at the terminal

MCC 6 nodes are pure and have identified subfamilies. Each tree creates a set of logic rules for the path of any test data. for each subfamily is at least 97.38 and the overall accuracy is 98.73.

Subfamily TABLE 3 PERFORMANCE OF AAC, DAAC, TAAC AND ALL AAC DAAC TAAC All Acc MCC Acc MCC Acc MCC Acc MCC NR1 76 0.32 90 0.74 97 0.93 99 0.98 NR2 88 0.53 95 0.82 99 0.97 100 0.98 NR3 87 0.46 87 0.

6 MCC 6 nodes are pure and have identified subfamilies. Each tree creates a set of logic rules for the path of any test data. for each subfamily is at least and the overall accuracy is Similarly, the MCC of each subfamily is at least 0.93 and the overall MCC is It is also interesting to note that our prediction accuracy for non-nr is 99.25%. Subfamily TABLE 3 PERFORMANCE OF AAC, DAAC, TAAC AND ALL AAC DAAC TAAC All Acc MCC Acc MCC Acc MCC Acc MCC NR NR NR NR NR NR NR0A NR0B Fig. 4. Part of a decision tree from the forest. The figure demonstrates how the node split takes place from the tree root (top) down to the terminal nodes (leaves). X[i] in each node indicates the feature that has been selected for split. The Gini index indicates whether the dataset in that node is pure. The non-zero Gini index means the dataset in that node level is impure and will be further split into other subsets while the zero Gini index signals that the node is terminal (leaf) and assigned a terminal class. The value of samples in the box indicates the number of samples. When a node is impure the samples will split amongst child nodes. The last line in the terminal nodes shows which class is assigned to the pure samples. As an example [ ] means that three samples were assigned to class 5. In the tree, we can also notice some of the important features. For example X[96] corresponds to the dipeptide DT that is shown in Fig Performance of the individual feature type In order to find the best combination of feature types, we compared the performance of the individual features types i.e. AAC, DAAC and TAAC, with the performance of the combination of all of them collectively (AAC+DAAC+ TAAC) (here-in RF-NR) in our framework. The performance of each of these features is shown in Table 3. The results of MCC for each of these are shown in Fig. 5. It can be clearly observed that combing the three feature types results in better prediction accuracy. Therefore, we used the combination of the three feature types in our method and represent each protein sequence using 8420 features: 20 for AAC, 400 for DAAC, and 8000 TAAC. For subsequent analysis we use 8420 features to represent a protein sequence. 3.4 Performance of the RF-NR To verify the effectiveness of the RF-NR, we performed the LOOCV on Benchmark Dataset 1. It has to be noted that protein sequences belonging to a particular sub-family were labeled as positive for that subfamily while the remaining ones were labeled as negative. The results of cross-validation are shown in Table 4. It is evident from Table 4 that the prediction accuracy of RF-NR NNR Overall All represents the combination of AAC, DAAC, and TAAC. The values were obtained based on the performance of RF-based model on Independent Dataset NR1 NR2 NR3 NR4 NR5 NR6 NR7 NR8 Non-NR Subfamilies AAC DAAC TAAC All Fig. 5. The performance in terms of MCC for the individual feature types (AAC, DAAC, TAAC) and their combination (All) Similarly, the performance of RF-NR was further evaluated on an independent dataset (independent dataset 1) that includes 2749 NR and 1064 Non-NR sequences. The results of the independent test dataset are shown in Table 5. It can be observed that for each sub-family, we were able to achieve a prediction accuracy of is at least 98.22% and our overall prediction accuracy is 99.60%. Similarly, we were able to achieve prediction MCC of at least 0.96 and overall MCC of It is also interesting to note that our prediction accuracy and MCC for non- NR is and 0.96 respectively. TABLE 4 PERFORMANCE OF RF-NR USING LOOCV (LEAVE-ONE-OUT

7 7 CROSS VALIDATION) ON BENCHMARK DATASET 1 subfamily Accuracy MCC Sensitivity Specificity NR NR NR NR NR NR NR0A NR0B NNR Overall TABLE 5 PERFORMANCE OF RF-NR ON INDEPENDENT DATASET 1 Subfamily Acc Prec Sens Spec. F1 MCC NR NR NR NR NR NR NR0A NR0B NNR Overall Comparing the RF-NR with Other Methods In order to assess the comparative performance of the RF-NR, we compared the RF-NR with various other existing methods. The main reason for having two sets of benchmark dataset is to be able to compare our method with most of these existing methods: NRPred-FS[8], NR- 2L[6], inr-physchem[7], and NRfamPred[9]. Since the Benchmark Dataset 1 is the dataset on which the NRPred-FS was based on, we compared our method with the NRPred-FS using this dataset. Similarly, the benchmark dataset 2 (used to build the inr-physchem, NR-2L and NRfamPred) would be used to compare our method with these methods. Comparing the RF-NR with NRPred-FS The results of the independent test samples for our method (RF-NR) and the NRPred-FS using the independent dataset 1 are summarized in Table 6 in the form of a confusion matrix. The column labeled correct for each shows the number of sequences that were correctly identified while the one that is labeled incorrect/subfamily shows the number of sequences that were incorrectly predicted and the incorrectly predicted subfamily. The column ACC denotes the accuracy of each method in percentage. It can be seen that for all the NR subfamilies and non-nr proteins, the accuracy of RF-NR is higher than NRPred-FS method. TABLE 6 COMPARATIVE RESULTS USING INDEPENDENT DATASET 1 FOR RF-NR AND NRPRED-FS Subfamily Total correct RF-NR Incorrect/ subfamily ACC correct NR /NNR NRPred-FS Incorrect/ subfamily 2/NR2, 54/NNR ACC NR /NNR /NNR /NR1, 17/NR2, NR /NNR /NNR 45/NR1, 1/NR2, NR /NNR NR /NNR /NR1, 25/NR NR NR0A NR0B /NR NNR /NR1, 108/NR O verall Furthermore, it can be observed from the table that, for subfamily NR1, the NRPred-FS method correctly predicted only 1034 proteins out of 1090 whereas it incorrectly predicted 2 proteins to be NR2 and 54 proteins as non-nr. In contrast, RF-NR correctly predicted 1053 proteins as NR1 and only 37 proteins were predicted to be non-nr. It is also interesting to note that RF-NR has no trouble distinguishing the subfamilies (except that sometimes incorrectly assigns them to non-nr), whereas NRPred-FS has trouble distinguishing some subfamilies. As an example out of 671 NR3 sequences (see Table 6) NRPred-FS could only classify 542 sequences correctly whereas it incorrectly classified 94 proteins as NR1 and 17 proteins as NR2 and 18 as non-nr. The confusion matrix shows clearly that the NRPred-FS model has difficulty in predicting NR3, NR4, and NR5 and in discriminating NRs from non-nrs. Fig. 6 shows the MCC for the various NR subfamilies and non-nr for both RF-NR and NRPred-FS based on independent test set 1. It is clear from Fig. 6 that RF-NR performs better than NRPred-FS. The relatively lower MCC scores for NR4 and NR5 of NRPred-FS are attributed to the fact that this method incorrectly assigned large number of NR4 and NR5 sequences to other subfamilies. The average MCC for RF-NR and NRPred-FS is 0.98 and 0.87 respectively. Other various metrics for the comparison of RF-NR and NRPred-FS are shown in Fig. 7. MCC is shown as a percentage in order to be in the same scale as the other metrics. It can be observed that among the various metrics presented, the RF-NR shows better performance than the NRPred-FS.

8 MCC 8 Fig. 6. MCC of RF-NR and NRPred-FS on the independent dataset NR1 NR2 NR3 NR4 NR5 NR6 NR0A NR0B NNR Subfamilies RF-NR NRPred-FS Accuracy Precision Sensitivity Specificity F1 score MCC RF-NR NRPred-FS Fig. 7. The overall metrics of both RF-NR and NRPred-FS on the independent dataset 1 (MCC in % scale) As an example for the improvement in the performance, we provide, as supplementary documents, the sequences that were classified correctly by the RF-NR but misclassified by NRPred-FS (see the supplementary sequence files NR0B_supp and NR1_supp to NR5_supp). The numbers of these sequences are 1, 54, 5, 125, 48, and 80 for the subfamilies NR0B, NR1, NR2, NR3, NR4, and NR5 respectively. As shown by Table 6, we can observe a significant improvement in the NR subfamily prediction with the RF-NR. This improvement can be explained by the fact that RF relies on the popular decision that is taken according to a strict decision rule based on the information extracted from the sequences rather than relying on a hyperplane, as in SVM, that separates the data points into categories. Such strategy renders RF less prone to confusion as the popular decision assigns the most probable subfamily whereas the separating hyperplane requires clear separable data points. Moreover, the spectrum-like features are known for their strength to detect the homology that characterizes the NR subfamilies. Generally we can say that both the feature extraction method and machine learning technique contributed to the improvement in the performance of the RF-NR. While spectrum-like features could outline the pattern landscape of the information flow along an individual training sequence and cross the entire training sequences exposing the boundaries between the close homologous sets of sequences, the RF could select the most important features that create the most prominent boundaries. As mentioned above and shown in Fig 2 and Fig 3 most important features are found in the most variant regions in NR sequences rather than in the conservative zones such as DBD and PBD. To contrast RF with SVM and to support our above hypothesis, the same dataset generated with spectrumlike features and used to train and test RF-NR were used, as well, to train and test a SVM model. The evaluation metrics generated with the independent test sequences are provided as supplementary document (see supplementary Table A). Comparing the RF-NR with the NRfamPred, inr- PhysChem and NR-2L In addition, we also compared the performance of RF- NR to NRfamPred [9], inr-physchem [7], and NR-2L [6] using Benchmark Dataset 2 because Benchmark Dataset 2 and Independent dataset 2 were used for the development of these methods. The corresponding values for Sensitivity and MCC for these methods were obtained from the NRfamPred paper [9]. The comparison of LOOCV performance of RF-NR, NRfamPred, inr- PhysChem and NR-2L is shown in Table 7. TABLE 7 PERFORMANCE COMPARISON OF RF-NR, NRFAMPRED, INR- PHYSCHEM AND NR-2L ON THE BENCHMARK DATASET 2 Subfamily RF-NR NRfamPred inr-physchem NR-2L Sens. MCC Sens. MCC Sens. MCC Sens. MCC NR NR NR NR NR NR NR Overall It can be observed from Table 7 that the RF-NR achieved 95.54% sensitivity and MCC of 0.94 which is the highest among all the compared methods. It is to be noted here that RF-NR has the highest sensitivity for all subfamily except subfamily NR1. The same is true for MCC. Furthermore, in order to compare the performance of these methods on a blind dataset, the Independent Dataset 2 was utilized. The results for RF-NR, NRfamPred and NR-2L are shown in Table 8. It can be observed that the RF-NR performs equally well as these other state-ofthe art method in real-life prediction scenario. The added advantage of our method compared to these methods is that we can get some insights about the importance of the features. TABLE 8 PERFORMANCE COMPARISON OF RF-NR, NRFAMPRED AND

9 9 Subfamily NR-2L ON INDEPENDENT DATASET 2 RF-NR NRfamPred NR-2L Sens MCC Sens MCC Sens MCC NR NR NR NR NR NR NR O verall IV. CONCLUSION AND DISCUSSION In this study, we developed a Random Forest based method (RF-NR) to identify NR proteins and subsequently classify them into respective NR subfamilies. The NR classification problem is posed as a multi-class classification problem and solved as one-vs.-rest strategy. The number of random trees in the forest was set to 200 based on improved prediction accuracy. The weighted class strategy is used to offset the imbalanced class problem. The RF-NR is able to predict with nearoptimal accuracy whether a query protein sequence belongs to one of the eight subfamilies or to the non-nr group. The new NR classification method used a combination of a discrete amino acid composition of different settings, which is a parsimonious technique, to extract the pattern information conserved due to the homology of the nuclear receptors. Essentially, the method uses spectrum-like features (k-mer features), namely AAC, DAAC and TAAC. The number of features is considerably high compared to the number of sequences, which may raise some concerns for those who are used to other machine learning algorithms such as support vector machine. In contrary, the RF can handle high dimensionality in elegant way by adopting a feature selection technique that ensures only important features that have strong prediction capability are selected (see Fig. 2, Fig. 3, and Fig. 4). Another concern is the overfitting, in which the model fits the test dataset well but it fits poorly when datasets other than the ones used in the study are used. Most overfitting problems come due to the fact that the dataset used for testing was used for the training as well. The training datasets used in this study were filtered from the closely similar and redundant sequences as explained in the dataset section. The test sequences, which were used for evaluation, are the sequences that were not included in the training dataset. Furthermore, two sets of datasets with varying redundancy reduction cut-offs (40% and 60%) were utilized to ensure that the higher value of prediction performance is not due to the high sequence similarity of the dataset. Moreover, to avoid bias by restricting the evaluation to a subset of sequences, as the previous studies did, we exhausted all the remaining sequences in NucleaRDB database to be sure about the generalization of our model. Furthermore, RF is less prone to overfitting as demonstrated empirically by Brieman [11]. We also examined the possibility of creating third dataset with 30% redundancy reduction cut-off but we were unsuccessful to train the model because of the very few numbers of samples left in each subfamily of NR proteins. The prediction of the subfamily is conducted by probabilistic averaging of all classifiers, which makes the prediction decision very robust and less affected by the noise and outliers. Moreover, the RF-NR uses Gini index as a criterion to select the features for node splitting and it uses the out-of-bag (OOB) as a build-in error estimate. The method was systematically validated with cross validation and independent test samples using two sets of datasets that have varying sequence redundancy reduction criteria. Performance on independent dataset and the comparative study between the RF-NR and NR-2L, inr- PhysChem, NRPred-FS and NRfamPred also proved that RF-NR performs equally well or better than some of the best predictors. In conclusion, we were able to develop an accurate, precise, and specific NR classification methods compared to state-of-the-art existing methods. A web site will be developed soon to serve the scientific community. V. ACKNOWLEDGMENTS The authors would like to thank the developer of NucleaRDB and Dr. Xiao for providing the dataset and the web server for NRPred-FS. DBKC is partly supported by a startup grant from the Department of Computational Science and Engineering at North Carolina A&T State University. DBKC is also partly supported by the National Science Foundation under Cooperative Agreement No. DBI VI. REFERENCES [1] H. Gronemeyer, J.A. Gustafason, and V. Laudet, Principles for modulation of the nuclear receptor superfamily, Nat Rev Drug Discov 3: , [2] A.L. Hopkins and C.R. Groom, The druggable genome, Nat. Rev. Drug. Discov 1: , [3] Nuclear Receptors Nomenclature Committee, A unified nomenclature system for the nuclear receptor superfamily, Cell 97 (2): doi: /s (00) , [4] Q.B. Gao, Z.C. Jin, X.F. Ye, C. Wu, and J. He, Prediction of nuclear receptors with optimal pseudo amino acid composition, Anal Biochem 387: 54-59, [5] B. Vroling, D. Thorne, P. McDermott, H.J. Joosten, T.K. Attwood, et al., NucleaRDB: Information System for Nuclear Receptors, Nucleic Acids Res 29: , 2012.

10 10 [6] P. Wang P, X. Xiao, and K.C. Chou, NR-2L: A Two- Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features, PLoS ONE 6(8): e doi: /journal.pone , [7] X. Xiao, P. Wang, and K.C. Chou, inr- PhysChem: A Sequence-based Predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix, Plos One, 7:e30869, [8] P. Wang and X. Xiao, NRPred-FS: A Feature Selection based Two-level Predictor for Nuclear Receptors, J Proteomics Bioinform S9: 002. doi: /jpb.s9-002, [9] R. Kumar, B. Kumari, A. Srivastava, M. Kumar, NR NRfamPred: a proteome-scale two level method for prediction of nuclear receptor proteins and their subfamilies. Sci Rep 4: 6810, [10] C. Leslie, E. Eskin, and W.S. Noble, The Spectrum Kernel: A String Kernel for SVM Protein Classification, Proceeding of Pacific Symposium in Bioinformatics (PSB), 7: , [11] L. Breiman, Random Forests, Machine Learning, 45(1):5-32, [12] V. Laudet, "Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor, J. Mol. Endocrinol. 19 (3): doi: /jme PMID , [13] H. Ying, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT Suite: a web server for clustering and comparing biological sequences, Bioinformatics, 26: , [14] D. A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell J, and D.L. Wheeler, GenBank, Nucleic Acids Research, 33: D34-D38, [15] K.C. Chou, Pseudo Amino Acid Composition and its Applications in Bioinformatics Proteomics and System Biology, Curr Proteomics, 6: , [16] K. Nishikawa, Y. Kubota, and T. Ooi, Classification of Proteins into Groups Based on Amino Acid Composition and Other Characters, J Biochem, 94: , [17] D.S. Cao, Q.S. Xu, and Y.Z. Liang, propy: a tool to generate various modes of Chou's PseAAC, Bioinformatics 29: , [18] T. Jo, and J. Cheng, Improving Fold Recognition by Random Forest, BMC Bioinformatics, 15(Suppl 11):S14, [19] F. Pedregosa, C.Varoquaux,, E. Duchesnay, Scikit-learn: Machine Learning in Python, JMLR, 12: , Hamid D. Ismail received DVM and Higher Diploma in Computer Programming and Statistics from University of Khartoum, and M.Sc. in computational science and engineering from NC A&T State University. He is a SAS Certified Advanced Programmer and Certified Oracle SQL expert. He worked as statistician and software developer with several companies. He is working currently as a research associate at NC A&T State University and pursuing a doctoral degree in computational science and engineering at NC A&T State University. Hiroto Saigo received the BS degree in electrical and electronics engineering from Sophia University in He received Master's degree and Doctor's degree both in informatics from Kyoto University in 2003 and 2006, respectively. He was also a visiting student at University of California, Irvine (UCI). He worked as a search scientist at Max Planck Institute (MPI) for Biological Cybernetics from 2008 to 2010, and MPI for Informatics from 2008 to Since 2010, he is an associate professor at Kyushu Institute of Technology. His research focuses on development of statistical machine learning methods tailored for problems in bioinformatics and cheminformatics. He serves as a reviewer/pc-member for numerous top-level conferences and journals. Dukka B KC received a B.E. in Computer Science a, M.Inf. in Bioinformatics and PhD in Bioinformatics from Kyoto University in 2001, 2003 and 2006, respectively. From 2006 to 2007, he was a postdoctoral fellow at Georgia Institute of Technology. From he was a postdoctoral fellow at the University of North Carolina at Charlotte. From he was a CRTA fellow at National Cancer Institute and Bioinformatics Scientist at Center for Information Technology at the National Institutes of Health. Currently, he is an assistant professor at North Carolina A&T State University. His research interests are in developing algorithms for deciphering sequence/structure/function evolution relationship of proteins.

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,