Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification

Size: px

Start display at page:

Download "Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification"

Meryl Whitehead
6 years ago
Views:

Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification B. Taboada *, E. Merino 2, C. Verde 3 blanca.taboada@ccadet.unam.

1 Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification B. Taboada *, E. Merino 2, C. Verde 3 blanca.taboada@ccadet.unam.mx Centro de Ciencias Aplicadas y Desarrollo Tecnológico, UNAM, Apdo. Postal 7-86, México, D.F., 45; 2 Instituto de Biotecnología, UNAM, Apdo. Postal 5-3, Cuernavaca, Morelos, 6225; 3 Instituto de Ingeniería, UNAM, Apdo. Postal 7-472, México D.F., 45; Abstract Due to the biological relevance of operons for coordinating the expression of metabolically or functionally related genes in bacterial organisms, different computational methods have been devised for classifying them, in the fast growing set of fully-sequenced genomes, but as far as we now, the best predictive accuracies obtained for the model organisms Escherichia coli and Bacillus subtilis trained with their corresponding nown operon dataset were 93% and 9%, respectively. In a previous wor, we had presented a simple and highly accurate classification method for operon prediction, based on intergenic distances and functional relationships between contiguous genes, as defined by the STRING database which scores are evaluated considering the weighted values coming from different ind of sources. These two parameters were used to train a neural networ on a subset of experimentally characterized Escherichia coli and Bacillus subtilis operons, with accuracies of 94.6% and 93.3%, respectively. As far as we now, these were the highest accuracies ever obtained for predicting bacterial operons. In this wor, we evaluate the relative contribution of each STRING feature in the overall accuracy operon classification. Moreover, we repeated the operon classification analysis considering the intergenic distances and the individual STRING features as input data obtaining a better classification. Introduction Operons can be defined as a gene or set of genes arranged contiguously on the same transcriptional strand of a genome sequence, which are co-transcribed in the same transcription unit (TU). Due to the biological relevance of operons to coordinate the expression of metabolically or functionally related genes in bacterial organisms, different computational protocols have been devised for identifying them [-4]. Some of the most important genome characteristics that have been considered are: (i) Transcription direction of the genes: this is a straight forward way of identifying the boundaries of certain operons, as genes in opposite strands always form part of different operons. (ii) Intergenic distances: the intergenic distances between contiguous genes of the same operon are generally shorter than the distance between contiguous genes of different operons [-4]. (iii) Expression gene pattern: genes from the same operon tend to have highly correlated values [2]. Unfortunately, gene expression data is only available for few organisms. (iv) Functional relationships between proteins encoded in an operon, as these genes commonly share similar or closely related functions [2,3]. (v) Conserved metabolic pathway encoded by the genes of the operon [2]. (vi) Conserved gene neighborhood; implies a tendency of the genes in an operon to be preserved across phylogenetically related organisms [2-4]. (vii) Phylogenetic profiles; indicating a general trend for a set of genes to be simultaneously present or absent in closely related organisms [2-4]. Despite extensive wor employing different computational approaches and genomic characteristics of the operons, the best classification accuracies obtained for the model organisms Escherichia coli and Bacillus subtilis trained with their corresponding nown operon data set were 93 and 9%, respectively [4]. As expected, these accuracy values decreased significantly, from to 3%, when

2 training and testing data sets did not correspond to the same organism. In a previous wor [5], we had presented a simple and highly accurate operon classification method, based on intergenic distances and functional relationships between contiguous genes, as defined by the STRING database [6] which scores are evaluated from the weighted values coming from seven different inds of sources. These two parameters were used to train a neural networ on a subset of experimentally characterized Escherichia coli and Bacillus subtilis operons, with accuracies of 94.6% and 93.3%, respectively. Moreover, the accuracy reduction was only.3% when training and testing data set were not from the same organism. As far as we now, these are the highest accuracies ever obtained for bacterial operons classification. In this wor, we repeated the operon classification analysis considering as input data the intergenic distances and some of the STRING features that we considered as relevant for the operon prediction analyses, instead of the integrate STRING scores. Therefore, the relative contribution of each STRING feature in the overall accuracy was evaluated to determine the non-informative or redundant ones, providing little additional class discriminatory information while increasing computational time and classifier complexity. 2. METHODS n Let G { G, KG } given by G { g, K =, g m }. Each gene pair ( g i g i ) { ( i ) q ( i i Y i, i + = y g i, g +, K, y g, g + )}, where each y j ( g i g i ) binary, integer, string, among others. In this sense, ( g g ) = be the set of n bacteria genomes, where each G is a set of m order genes,, + is characterized by a set of attributes, + has its own domain D j that could be y j i, i+ denote the attribute y j D j of the g i and g i + genes of the G genome in n-dimensional spacer n. As mentioned in the introduction, different continuous genes functionally or related are co-transcribed in a same unit called operon. Thus, the operon classification method must associated each continues gene pair to one class label Ω = { O, O}, O when they belong to the same operon and O in the contrary case. In other words, to mapy i, i+ : R n Ω, evaluating the contribution of y j ( g i, g i+ ) in the classification process, with j =, Kq. 2. Data set In this wor, it was used the same E. coli data set of our previous wor for which there is a STRING score associated to a COG group [6]; 435 operonic gene pairs (contiguous genes of the same operon) and 39 non-operonic gene pairs (5 and 3 operon gene borders and their corresponding upstream and downstream adjacent genes transcribed in the same direction). 2. Features Previously, the intergenic distances and integrated STRING score were used to classify operons in bacteria genomes [5]. In this wor, intergenic distances and the individual STRING values coming from seven different inds of sources will be used instead, in order to determine the most informative STRING features in operon classification. Intergenic distances ( y ( g i, g i+ ): In accordance with (), it was found that in E. coli the intergenic distances of operonic gene pairs tend to be shorter than intergenic distances between non-operonic gene pairs. Gene neighborhood ( y2 ( g i, g i+ ): Implies a tendency of operonic genes to be preserved across phylogenetic related organisms [2-4], which is a good indicator for functional linage. Gene fusion ( y3 ( g i, g i+ ): genes joined to encode a single fusion protein, which is indicative of functional linage even in organisms where the two proteins have not fused. 2

3 Gene co-occurrence ( ( g i g ) y4, i+ : Indicating a general trend for a set of operonic genes to be simultaneously present in closely related organisms [2-4]. This, again, predicts that they contribute to similar functional processes in the cell. Gene co-expression ( y5 ( g i, g i+ ): operonic genes display a similar transcriptional response across a variety of conditions [2]. Experimental derived protein protein interactions Information coming from other DB ( y7 ( g i, g i+ ): Protein association nowledge from databases of curate biological pathway nowledge. Automatic literature mining ( y8 ( g i, gi+ ): in order to discover co-mentioned genes which may implies a functional relationship between them. 2. Contributions of the different features j i, i+ j = K in the overall In order to evaluate the relative contribution of the features y ( g g ), 8 classification of the operon classification method and to select the most informative, a multilayer perceptrón artificial NN was implemented to attempt minimizing the error between the desired and the predicted outputs. The design of the NN involved three main steps: (i) Input data pre-processing carried out by normalizing all input features in the same range of the activation function (hyperbolic tangent) of hidden neurons [-,] in order to avoid an exponential calculation overflow and to ensure that the range for each feature does not influence the performance of the NN. (ii) Selection of appropriate networ architecture by testing different configurations of topologies, varying the number of layers and neurons for each layer; the networ used consisted of three layers: one input layer of eight neurons, one hidden layer of eleven neurons (it is the number which gives the best prediction result) and one output layer of one neuron. The desired outputs have values of either for gene pairs that belong to the same operon or for gene pairs that do not belong to the same operon. (iii) Selection of the training algorithm; the quic propagation algorithm was used. The conventional one-training-and-one-testing validation was performed to obtain the accuracy of the NN by randomly divided the input data into 8% used as training and % as testing. The contribution of y j ( gi, g i+ ), j = K8 was determined by a partition of the hidden-output connection weights of each hidden neuron into components associated with each input neuron [7] as follows: a) For each hidden neuron h, divide the absolute value of the input-hidden layer connection weight by the sum of the absolute value of the input-hidden layer connection weight of all input neurons: For h = to nh, For i = to ni, Wih Qih = ni W i= ih End End where nh is the number of neurons in the hidden layer, in this case nh = ; ni is the number of neurons in the input layer, ni = 8 ; W ih is the connection weight of neuron i and neuron h. b) For each input neuron i, divide the sum of the Q ih for each hidden neuron by the sum for each hidden neuron of the sum for each input neuron of Q ih, multiply by. The relative importance of all output weights attributable to the given input variable is then obtained. For i = to ni, 3

4 End RI(%) nh Q ih h= = nh nh h= i= Q ih 2.3 Performance measurement As previously undertaen in our operon classification study [5], the efficiency of operon classification was calculated as follows: Sensitivit y = TP/( TP + FN), Specificit y = TN /( TN + FP), Pr ecision = ( TP + TN ) /( TP + FP + TN + FN ) where TP (true positives) represent the operonic gene pairs correctly predicted among nown operonic pairs; FN (false negatives) represent the operonic gene pairs incorrectly predicted as nonoperonic; TN (true negatives) represent the correctly predicted non-operonic gene pairs in nown non-operonic pairs and FP (false positives) represent the non-operonic gene pairs incorrectly predicted as operonic. 3. RESULTS The accuracy obtained by this new operon classification method including all individual STRING features was slightly better (96.6% versus 95.%) than the one obtained in our previous wor [5]. The relative contribution of each feature is as follows: intergenic distance: 25.%, gene neighborhood: 32.6%, gene fusion:.5%, gene co-occurrence: 2.3%, gene co-expression: 7.7%, experimental derived protein protein interactions:.8%, information coming from other databases: 9.8% and automatic literature mining: 2.2%. These results show the importance of each variable in discriminating operon and non-operon classes. This also is showed in Figure, where it can be see that gene fusion (Figure C) and experimental information (Figure F) are the features that show less difference between the data of E. coli operonic and non-operonic pairs. This essentially is due to the amount of information of these variables in the STRING DB (Table ), because some of them are more represented (more records) then others. Features Records number Representation in STRING DB STRING weighted scores 2,5,886 Gene neighborhood 2,865, % Gene fusion 3,446.% Gene co-occurrence 6,59, % Gene co-expression 965, % Experimental inf. 473, % Inf. other DB 3, % Literature mining 2,223,96 8.5% Table. STRING features representation in relation to the total size (records) of the DB Subsequently, the analysis of operon classification was repeated using this time the features that contribute most as input; gene fusion, gene co-occurrence and experimental information were not considered. A NN with tree-layers/five-nine-one-neurons networ architecture was used. Interestingly, the accuracy obtained by this new NN was only slightly worst (96.6 versus 95.8%), than the one obtained when using all the STRING features, but the computational time and classifier complexity was decreased. Further study on this topic is to validate this result by a -fold cross-validation to estimate how good generalization can be made and to apply this method to other genomes to evaluate the efficiency. 4

5 ACKNOWLEDGMENT This wor was supported by CONACyT (grants 627-Q and SALUD-7-C-68992) to E.M.; A) 5 B) C) Relative frequency (%) Intergenic distances (bp) Gene neighborhood Gene fusion D) E) F) Relative frequency (%) Gene co-occurrence G) Gene co-expression H) Experimental non-operonic pairs operonic pairs Inf. other DB Mining DGAPA (IN2278) to E.M. Figure. Frequency distribution of intergenic distances and STRING features of E. coli operonic and non-operonic genes REFERENCES. Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado-Vides,J. () Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA, 97, Ouda,S., Kawashima,S., Kobayashi,K., Ogasawara,N., Kanehisa,M. and Goto,S. (7) Characterization of relationships between transcriptional units and operon structures in Bacillus subtilis and Escherichia coli. BMC Genomics, 8, Romero,P.R. and Karp,P.D. (4) Using functional and organizational information to improve genomewide computational prediction of transcription units on pathway-genome databases. Bioinformatics,, Dam,P., Olman,V., Harris,K., Su,Z. and Xu,Y. (7) Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res., 35,

6 5. Taboada, R.B., Verde,R.C., Merino,P. E. (), High accuracy operon prediction method based on STRING database scores, Nucleic Acids Res., published online. 6. Jensen,L.J., Kuhn,M., Star,M., Chaffron,S., Creevey,C., Muller,J., Doers,T., Julien,P., Roth,A., Simonovic,M. et al. (9) STRING 8 a global view on proteins and their functional interactions in 63 organisms. Nucleic Acids Res., 37, D42 D Gevrey M., Dimopoulos L., Le S. (3) Review and comparison of methods to study the contribution of variables in artificial neural networ models, Ecological Modelling, 6,

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

University of Groningen Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's