Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification
|
|
- Meryl Whitehead
- 6 years ago
- Views:
Transcription
1 Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification B. Taboada *, E. Merino 2, C. Verde 3 blanca.taboada@ccadet.unam.mx Centro de Ciencias Aplicadas y Desarrollo Tecnológico, UNAM, Apdo. Postal 7-86, México, D.F., 45; 2 Instituto de Biotecnología, UNAM, Apdo. Postal 5-3, Cuernavaca, Morelos, 6225; 3 Instituto de Ingeniería, UNAM, Apdo. Postal 7-472, México D.F., 45; Abstract Due to the biological relevance of operons for coordinating the expression of metabolically or functionally related genes in bacterial organisms, different computational methods have been devised for classifying them, in the fast growing set of fully-sequenced genomes, but as far as we now, the best predictive accuracies obtained for the model organisms Escherichia coli and Bacillus subtilis trained with their corresponding nown operon dataset were 93% and 9%, respectively. In a previous wor, we had presented a simple and highly accurate classification method for operon prediction, based on intergenic distances and functional relationships between contiguous genes, as defined by the STRING database which scores are evaluated considering the weighted values coming from different ind of sources. These two parameters were used to train a neural networ on a subset of experimentally characterized Escherichia coli and Bacillus subtilis operons, with accuracies of 94.6% and 93.3%, respectively. As far as we now, these were the highest accuracies ever obtained for predicting bacterial operons. In this wor, we evaluate the relative contribution of each STRING feature in the overall accuracy operon classification. Moreover, we repeated the operon classification analysis considering the intergenic distances and the individual STRING features as input data obtaining a better classification. Introduction Operons can be defined as a gene or set of genes arranged contiguously on the same transcriptional strand of a genome sequence, which are co-transcribed in the same transcription unit (TU). Due to the biological relevance of operons to coordinate the expression of metabolically or functionally related genes in bacterial organisms, different computational protocols have been devised for identifying them [-4]. Some of the most important genome characteristics that have been considered are: (i) Transcription direction of the genes: this is a straight forward way of identifying the boundaries of certain operons, as genes in opposite strands always form part of different operons. (ii) Intergenic distances: the intergenic distances between contiguous genes of the same operon are generally shorter than the distance between contiguous genes of different operons [-4]. (iii) Expression gene pattern: genes from the same operon tend to have highly correlated values [2]. Unfortunately, gene expression data is only available for few organisms. (iv) Functional relationships between proteins encoded in an operon, as these genes commonly share similar or closely related functions [2,3]. (v) Conserved metabolic pathway encoded by the genes of the operon [2]. (vi) Conserved gene neighborhood; implies a tendency of the genes in an operon to be preserved across phylogenetically related organisms [2-4]. (vii) Phylogenetic profiles; indicating a general trend for a set of genes to be simultaneously present or absent in closely related organisms [2-4]. Despite extensive wor employing different computational approaches and genomic characteristics of the operons, the best classification accuracies obtained for the model organisms Escherichia coli and Bacillus subtilis trained with their corresponding nown operon data set were 93 and 9%, respectively [4]. As expected, these accuracy values decreased significantly, from to 3%, when
2 training and testing data sets did not correspond to the same organism. In a previous wor [5], we had presented a simple and highly accurate operon classification method, based on intergenic distances and functional relationships between contiguous genes, as defined by the STRING database [6] which scores are evaluated from the weighted values coming from seven different inds of sources. These two parameters were used to train a neural networ on a subset of experimentally characterized Escherichia coli and Bacillus subtilis operons, with accuracies of 94.6% and 93.3%, respectively. Moreover, the accuracy reduction was only.3% when training and testing data set were not from the same organism. As far as we now, these are the highest accuracies ever obtained for bacterial operons classification. In this wor, we repeated the operon classification analysis considering as input data the intergenic distances and some of the STRING features that we considered as relevant for the operon prediction analyses, instead of the integrate STRING scores. Therefore, the relative contribution of each STRING feature in the overall accuracy was evaluated to determine the non-informative or redundant ones, providing little additional class discriminatory information while increasing computational time and classifier complexity. 2. METHODS n Let G { G, KG } given by G { g, K =, g m }. Each gene pair ( g i g i ) { ( i ) q ( i i Y i, i + = y g i, g +, K, y g, g + )}, where each y j ( g i g i ) binary, integer, string, among others. In this sense, ( g g ) = be the set of n bacteria genomes, where each G is a set of m order genes,, + is characterized by a set of attributes, + has its own domain D j that could be y j i, i+ denote the attribute y j D j of the g i and g i + genes of the G genome in n-dimensional spacer n. As mentioned in the introduction, different continuous genes functionally or related are co-transcribed in a same unit called operon. Thus, the operon classification method must associated each continues gene pair to one class label Ω = { O, O}, O when they belong to the same operon and O in the contrary case. In other words, to mapy i, i+ : R n Ω, evaluating the contribution of y j ( g i, g i+ ) in the classification process, with j =, Kq. 2. Data set In this wor, it was used the same E. coli data set of our previous wor for which there is a STRING score associated to a COG group [6]; 435 operonic gene pairs (contiguous genes of the same operon) and 39 non-operonic gene pairs (5 and 3 operon gene borders and their corresponding upstream and downstream adjacent genes transcribed in the same direction). 2. Features Previously, the intergenic distances and integrated STRING score were used to classify operons in bacteria genomes [5]. In this wor, intergenic distances and the individual STRING values coming from seven different inds of sources will be used instead, in order to determine the most informative STRING features in operon classification. Intergenic distances ( y ( g i, g i+ ): In accordance with (), it was found that in E. coli the intergenic distances of operonic gene pairs tend to be shorter than intergenic distances between non-operonic gene pairs. Gene neighborhood ( y2 ( g i, g i+ ): Implies a tendency of operonic genes to be preserved across phylogenetic related organisms [2-4], which is a good indicator for functional linage. Gene fusion ( y3 ( g i, g i+ ): genes joined to encode a single fusion protein, which is indicative of functional linage even in organisms where the two proteins have not fused. 2
3 Gene co-occurrence ( ( g i g ) y4, i+ : Indicating a general trend for a set of operonic genes to be simultaneously present in closely related organisms [2-4]. This, again, predicts that they contribute to similar functional processes in the cell. Gene co-expression ( y5 ( g i, g i+ ): operonic genes display a similar transcriptional response across a variety of conditions [2]. Experimental derived protein protein interactions Information coming from other DB ( y7 ( g i, g i+ ): Protein association nowledge from databases of curate biological pathway nowledge. Automatic literature mining ( y8 ( g i, gi+ ): in order to discover co-mentioned genes which may implies a functional relationship between them. 2. Contributions of the different features j i, i+ j = K in the overall In order to evaluate the relative contribution of the features y ( g g ), 8 classification of the operon classification method and to select the most informative, a multilayer perceptrón artificial NN was implemented to attempt minimizing the error between the desired and the predicted outputs. The design of the NN involved three main steps: (i) Input data pre-processing carried out by normalizing all input features in the same range of the activation function (hyperbolic tangent) of hidden neurons [-,] in order to avoid an exponential calculation overflow and to ensure that the range for each feature does not influence the performance of the NN. (ii) Selection of appropriate networ architecture by testing different configurations of topologies, varying the number of layers and neurons for each layer; the networ used consisted of three layers: one input layer of eight neurons, one hidden layer of eleven neurons (it is the number which gives the best prediction result) and one output layer of one neuron. The desired outputs have values of either for gene pairs that belong to the same operon or for gene pairs that do not belong to the same operon. (iii) Selection of the training algorithm; the quic propagation algorithm was used. The conventional one-training-and-one-testing validation was performed to obtain the accuracy of the NN by randomly divided the input data into 8% used as training and % as testing. The contribution of y j ( gi, g i+ ), j = K8 was determined by a partition of the hidden-output connection weights of each hidden neuron into components associated with each input neuron [7] as follows: a) For each hidden neuron h, divide the absolute value of the input-hidden layer connection weight by the sum of the absolute value of the input-hidden layer connection weight of all input neurons: For h = to nh, For i = to ni, Wih Qih = ni W i= ih End End where nh is the number of neurons in the hidden layer, in this case nh = ; ni is the number of neurons in the input layer, ni = 8 ; W ih is the connection weight of neuron i and neuron h. b) For each input neuron i, divide the sum of the Q ih for each hidden neuron by the sum for each hidden neuron of the sum for each input neuron of Q ih, multiply by. The relative importance of all output weights attributable to the given input variable is then obtained. For i = to ni, 3
4 End RI(%) nh Q ih h= = nh nh h= i= Q ih 2.3 Performance measurement As previously undertaen in our operon classification study [5], the efficiency of operon classification was calculated as follows: Sensitivit y = TP/( TP + FN), Specificit y = TN /( TN + FP), Pr ecision = ( TP + TN ) /( TP + FP + TN + FN ) where TP (true positives) represent the operonic gene pairs correctly predicted among nown operonic pairs; FN (false negatives) represent the operonic gene pairs incorrectly predicted as nonoperonic; TN (true negatives) represent the correctly predicted non-operonic gene pairs in nown non-operonic pairs and FP (false positives) represent the non-operonic gene pairs incorrectly predicted as operonic. 3. RESULTS The accuracy obtained by this new operon classification method including all individual STRING features was slightly better (96.6% versus 95.%) than the one obtained in our previous wor [5]. The relative contribution of each feature is as follows: intergenic distance: 25.%, gene neighborhood: 32.6%, gene fusion:.5%, gene co-occurrence: 2.3%, gene co-expression: 7.7%, experimental derived protein protein interactions:.8%, information coming from other databases: 9.8% and automatic literature mining: 2.2%. These results show the importance of each variable in discriminating operon and non-operon classes. This also is showed in Figure, where it can be see that gene fusion (Figure C) and experimental information (Figure F) are the features that show less difference between the data of E. coli operonic and non-operonic pairs. This essentially is due to the amount of information of these variables in the STRING DB (Table ), because some of them are more represented (more records) then others. Features Records number Representation in STRING DB STRING weighted scores 2,5,886 Gene neighborhood 2,865, % Gene fusion 3,446.% Gene co-occurrence 6,59, % Gene co-expression 965, % Experimental inf. 473, % Inf. other DB 3, % Literature mining 2,223,96 8.5% Table. STRING features representation in relation to the total size (records) of the DB Subsequently, the analysis of operon classification was repeated using this time the features that contribute most as input; gene fusion, gene co-occurrence and experimental information were not considered. A NN with tree-layers/five-nine-one-neurons networ architecture was used. Interestingly, the accuracy obtained by this new NN was only slightly worst (96.6 versus 95.8%), than the one obtained when using all the STRING features, but the computational time and classifier complexity was decreased. Further study on this topic is to validate this result by a -fold cross-validation to estimate how good generalization can be made and to apply this method to other genomes to evaluate the efficiency. 4
5 ACKNOWLEDGMENT This wor was supported by CONACyT (grants 627-Q and SALUD-7-C-68992) to E.M.; A) 5 B) C) Relative frequency (%) Intergenic distances (bp) Gene neighborhood Gene fusion D) E) F) Relative frequency (%) Gene co-occurrence G) Gene co-expression H) Experimental non-operonic pairs operonic pairs Inf. other DB Mining DGAPA (IN2278) to E.M. Figure. Frequency distribution of intergenic distances and STRING features of E. coli operonic and non-operonic genes REFERENCES. Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado-Vides,J. () Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA, 97, Ouda,S., Kawashima,S., Kobayashi,K., Ogasawara,N., Kanehisa,M. and Goto,S. (7) Characterization of relationships between transcriptional units and operon structures in Bacillus subtilis and Escherichia coli. BMC Genomics, 8, Romero,P.R. and Karp,P.D. (4) Using functional and organizational information to improve genomewide computational prediction of transcription units on pathway-genome databases. Bioinformatics,, Dam,P., Olman,V., Harris,K., Su,Z. and Xu,Y. (7) Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res., 35,
6 5. Taboada, R.B., Verde,R.C., Merino,P. E. (), High accuracy operon prediction method based on STRING database scores, Nucleic Acids Res., published online. 6. Jensen,L.J., Kuhn,M., Star,M., Chaffron,S., Creevey,C., Muller,J., Doers,T., Julien,P., Roth,A., Simonovic,M. et al. (9) STRING 8 a global view on proteins and their functional interactions in 63 organisms. Nucleic Acids Res., 37, D42 D Gevrey M., Dimopoulos L., Le S. (3) Review and comparison of methods to study the contribution of variables in artificial neural networ models, Ecological Modelling, 6,
Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem
University of Groningen Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's
More informationProOpDB: Prokaryotic Operon DataBase
Published online 16 November 2011 D627 D631 doi:10.1093/nar/gkr1020 ProOpDB: Prokaryotic Operon DataBase Blanca Taboada 1, Ricardo Ciria 2, Cristian E. Martinez-Guerrero 2 and Enrique Merino 2, * 1 Centro
More informationARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92
ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000
More informationThe EcoCyc Database. January 25, de Nitrógeno, UNAM,Cuernavaca, A.P. 565-A, Morelos, 62100, Mexico;
The EcoCyc Database Peter D. Karp, Monica Riley, Milton Saier,IanT.Paulsen +, Julio Collado-Vides + Suzanne M. Paley, Alida Pellegrini-Toole,César Bonavides ++, and Socorro Gama-Castro ++ January 25, 2002
More informationINTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA
INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology
More informationThis document describes the process by which operons are predicted for genes within the BioHealthBase database.
1. Purpose This document describes the process by which operons are predicted for genes within the BioHealthBase database. 2. Methods Description An operon is a coexpressed set of genes, transcribed onto
More informationDynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -
Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes - Supplementary Information - Martin Bartl a, Martin Kötzing a,b, Stefan Schuster c, Pu Li a, Christoph Kaleta b a
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationPrediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines
Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,
More informationGenomic Arrangement of Regulons in Bacterial Genomes
l Genomes Han Zhang 1,2., Yanbin Yin 1,3., Victor Olman 1, Ying Xu 1,3,4 * 1 Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology and Institute of Bioinformatics,
More informationCOMPARATIVE PATHWAY ANNOTATION WITH PROTEIN-DNA INTERACTION AND OPERON INFORMATION VIA GRAPH TREE DECOMPOSITION
COMPARATIVE PATHWAY ANNOTATION WITH PROTEIN-DNA INTERACTION AND OPERON INFORMATION VIA GRAPH TREE DECOMPOSITION JIZHEN ZHAO, DONGSHENG CHE AND LIMING CAI Department of Computer Science, University of Georgia,
More informationaddresses: b Department of Mathematics and Statistics, G.N. Khalsa College, University of Mumbai, India. a.
Reaching Optimized Parameter Set: Protein Secondary Structure Prediction Using Neural Network DongardiveJyotshna* a, Siby Abraham *b a Department of Computer Science, University of Mumbai, Mumbai, India
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationComputational approaches for functional genomics
Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More information2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.
Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand
More informationIdentification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability
Annotation of promoter regions in microbial genomes 851 Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability VETRISELVI RANGANNAN and MANJU BANSAL*
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationPREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING
PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING Peiying (Colleen) Ruan, PhD, Deep Learning Solution Architect 3/26/2018 Background OUTLINE Method
More informationSUPPLEMENTARY INFORMATION
Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,
More informationAn artificial neural networks (ANNs) model is a functional abstraction of the
CHAPER 3 3. Introduction An artificial neural networs (ANNs) model is a functional abstraction of the biological neural structures of the central nervous system. hey are composed of many simple and highly
More informationEssentiality in B. subtilis
Essentiality in B. subtilis 100% 75% Essential genes Non-essential genes Lagging 50% 25% Leading 0% non-highly expressed highly expressed non-highly expressed highly expressed 1 http://www.pasteur.fr/recherche/unites/reg/
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationOn Computational Limitations of Neural Network Architectures
On Computational Limitations of Neural Network Architectures Achim Hoffmann + 1 In short A powerful method for analyzing the computational abilities of neural networks based on algorithmic information
More informationProtein Structure Prediction Using Neural Networks
Protein Structure Prediction Using Neural Networks Martha Mercaldi Kasia Wilamowska Literature Review December 16, 2003 The Protein Folding Problem Evolution of Neural Networks Neural networks originally
More informationk k k 1 Lecture 9: Applying Backpropagation Lecture 9: Applying Backpropagation 3 Lecture 9: Applying Backpropagation
K-Class Classification Problem Let us denote the -th class by C, with n exemplars or training samples, forming the sets T for = 1,, K: {( x, ) p = 1 n } T = d,..., p p The complete training set is T =
More informationMicrobial computational genomics of gene regulation*
Pure Appl. Chem., Vol. 74, No. 6, pp. 899 905, 2002. 2002 IUPAC Microbial computational genomics of gene regulation* Julio Collado-Vides, Gabriel Moreno-Hagelsieb, and Arturo Medrano-Soto Program of Computational
More informationNonlinear Classification
Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions
More informationSupplementary Information
Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationSupplementary Information
Supplementary Information 1 List of Figures 1 Models of circular chromosomes. 2 Distribution of distances between core genes in Escherichia coli K12, arc based model. 3 Distribution of distances between
More informationArtificial Neural Network : Training
Artificial Neural Networ : Training Debasis Samanta IIT Kharagpur debasis.samanta.iitgp@gmail.com 06.04.2018 Debasis Samanta (IIT Kharagpur) Soft Computing Applications 06.04.2018 1 / 49 Learning of neural
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationComparative genomics: Overview & Tools + MUMmer algorithm
Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first
More information56:198:582 Biological Networks Lecture 9
56:198:582 Biological Networks Lecture 9 The Feed-Forward Loop Network Motif Subgraphs in random networks We have discussed the simplest network motif, self-regulation, a pattern with one node We now consider
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationModified Learning for Discrete Multi-Valued Neuron
Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Modified Learning for Discrete Multi-Valued Neuron Jin-Ping Chen, Shin-Fu Wu, and Shie-Jue Lee Department
More informationComputational methods for predicting protein-protein interactions
Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More informationHYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH
HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi
More informationCS612 - Algorithms in Bioinformatics
Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available
More informationSupplementary Information
Supplementary Information Performance measures A binary classifier, such as SVM, assigns to predicted binding sequences the positive class label (+1) and to sequences predicted as non-binding the negative
More informationCS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014
CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof.
More informationArticle from. Predictive Analytics and Futurism. July 2016 Issue 13
Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationA. Pelliccioni (*), R. Cotroneo (*), F. Pungì (*) (*)ISPESL-DIPIA, Via Fontana Candida 1, 00040, Monteporzio Catone (RM), Italy.
Application of Neural Net Models to classify and to forecast the observed precipitation type at the ground using the Artificial Intelligence Competition data set. A. Pelliccioni (*), R. Cotroneo (*), F.
More informationTopic 4 - #14 The Lactose Operon
Topic 4 - #14 The Lactose Operon The Lactose Operon The lactose operon is an operon which is responsible for the transport and metabolism of the sugar lactose in E. coli. - Lactose is one of many organic
More informationIn the Name of God. Lecture 11: Single Layer Perceptrons
1 In the Name of God Lecture 11: Single Layer Perceptrons Perceptron: architecture We consider the architecture: feed-forward NN with one layer It is sufficient to study single layer perceptrons with just
More informationStructure Design of Neural Networks Using Genetic Algorithms
Structure Design of Neural Networks Using Genetic Algorithms Satoshi Mizuta Takashi Sato Demelo Lao Masami Ikeda Toshio Shimizu Department of Electronic and Information System Engineering, Faculty of Science
More informationAN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009
AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationCSCE555 Bioinformatics. Protein Function Annotation
CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The
More informationPrediction of Protein Essentiality by the Support Vector Machine with Statistical Tests
Evolutionary Bioinformatics Original Research Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Prediction of Protein Essentiality by the Support Vector Machine
More informationBacterial Genetics & Operons
Bacterial Genetics & Operons The Bacterial Genome Because bacteria have simple genomes, they are used most often in molecular genetics studies Most of what we know about bacterial genetics comes from the
More informationMatrix-based pattern discovery algorithms
Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
More informationNeural Networks and Ensemble Methods for Classification
Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated
More informationEstimation of Inelastic Response Spectra Using Artificial Neural Networks
Estimation of Inelastic Response Spectra Using Artificial Neural Networks J. Bojórquez & S.E. Ruiz Universidad Nacional Autónoma de México, México E. Bojórquez Universidad Autónoma de Sinaloa, México SUMMARY:
More informationK-means-based Feature Learning for Protein Sequence Classification
K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract
More informationSome Problems from Enzyme Families
Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems
More informationCSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning
CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.
More informationMachine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler
+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions
More informationTMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg
title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1 Last time 2 3 Yet another
More informationMutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802
Mutual Information & Genotype-Phenotype Association Norman MacDonald January 31, 2011 CSCI 4181/6802 2 Overview What is information (specifically Shannon Information)? What are information entropy and
More informationProteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?
Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationIntegration of Omics Data to Investigate Common Intervals
2011 International Conference on Bioscience, Biochemistry and Bioinformatics IPCBEE vol.5 (2011) (2011) IACSIT Press, Singapore Integration of Omics Data to Investigate Common Intervals Sébastien Angibaud,
More informationA Novel Prediction Method of Protein Structural Classes Based on Protein Super-Secondary Structure
Journal of Computer and Communications, 2016, 4, 54-62 http://www.scirp.org/journal/jcc ISSN Online: 2327-5227 ISSN Print: 2327-5219 A Novel Prediction Method of Protein Structural Classes Based on Protein
More informationTaxonomy. Content. How to determine & classify a species. Phylogeny and evolution
Taxonomy Content Why Taxonomy? How to determine & classify a species Domains versus Kingdoms Phylogeny and evolution Why Taxonomy? Classification Arrangement in groups or taxa (taxon = group) Nomenclature
More informationBacillus anthracis. Last Lecture: 1. Introduction 2. History 3. Koch s Postulates. 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes
Last Lecture: Bacillus anthracis 1. Introduction 2. History 3. Koch s Postulates Today s Lecture: 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes 3. Phylogenetics I. Basic Cell structure: (Fig.
More informationConservation of Gene Co-Regulation between Two Prokaryotes: Bacillus subtilis and Escherichia coli
116 Genome Informatics 16(1): 116 124 (2005) Conservation of Gene Co-Regulation between Two Prokaryotes: Bacillus subtilis and Escherichia coli Shujiro Okuda 1 Shuichi Kawashima 2 okuda@kuicr.kyoto-u.ac.jp
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationBiology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.
READING: 14.2 Bacterial Genomes p. 481 14.3 Gene Transfer Mechanisms in Bacteria p. 486 Suggested Problems: 1, 7, 13, 14, 15, 20, 22 BACTERIAL GENETICS AND GENOMICS We still consider the E. coli genome
More informationPattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The
More informationA genomic-scale search for regulatory binding sites in the integration host factor regulon of Escherichia coli K12
The integration host factor regulon of E. coli K12 genome 783 A genomic-scale search for regulatory binding sites in the integration host factor regulon of Escherichia coli K12 M. Trindade dos Santos and
More informationBioinformatics 2. Yeast two hybrid. Proteomics. Proteomics
GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein
More informationData Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td
Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak
More informationSUPPLEMENTARY MATERIALS
SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationDynamic Clustering-Based Estimation of Missing Values in Mixed Type Data
Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data Vadim Ayuyev, Joseph Jupin, Philip Harris and Zoran Obradovic Temple University, Philadelphia, USA 2009 Real Life Data is Often
More informationIntelligent Handwritten Digit Recognition using Artificial Neural Network
RESEARCH ARTICLE OPEN ACCESS Intelligent Handwritten Digit Recognition using Artificial Neural Networ Saeed AL-Mansoori Applications Development and Analysis Center (ADAC), Mohammed Bin Rashid Space Center
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationGenetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.
Genetic Variation: The genetic substrate for natural selection What about organisms that do not have sexual reproduction? Horizontal Gene Transfer Dr. Carol E. Lee, University of Wisconsin In prokaryotes:
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationLecture 16: Introduction to Neural Networks
Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,
More informationCourse 395: Machine Learning - Lectures
Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture
More information#33 - Genomics 11/09/07
BCB 444/544 Required Reading (before lecture) Lecture 33 Mon Nov 5 - Lecture 31 Phylogenetics Parsimony and ML Chp 11 - pp 142 169 Genomics Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33
More informationIntroduction to Bioinformatics Integrated Science, 11/9/05
1 Introduction to Bioinformatics Integrated Science, 11/9/05 Morris Levy Biological Sciences Research: Evolutionary Ecology, Plant- Fungal Pathogen Interactions Coordinator: BIOL 495S/CS490B/STAT490B Introduction
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationProtein-Protein Interaction Classification Using Jordan Recurrent Neural Network
Protein-Protein Interaction Classification Using Jordan Recurrent Neural Network Dilpreet Kaur Department of Computer Science and Engineering PEC University of Technology Chandigarh, India dilpreet.kaur88@gmail.com
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann
ECLT 5810 Classification Neural Networks Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann Neural Networks A neural network is a set of connected input/output
More information