PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS

Size: px
Start display at page:

Download "PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS"

Transcription

1 PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS T. Z. SEN, A. KLOCZKOWSKI, R. L. JERNIGAN L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA C. YAN, V. HONAVAR Department of Computer Science, Iowa State University Ames, IA K. M. HO, C. Z. WANG, Y. IHM, H. CAO Department of Physics and Astronomy, Iowa State University Ames, IA D. L. DOBBS, X. GU Department of Genetics, Development and Cell Biology, Iowa State University Ames, IA We have combined four different methods for the prediction of binding sites in proteins. The methods used included: data mining approach by Support Vector Machines, threading through protein structures, prediction of conserved residues on the protein surface by analysis of phylogenetic trees, and the Conservatism of Conservatism method of Mirny and Shakhnovich. We have combined all these methods together by developing consensus predictions. Our results show that the combination of all methods improves the accuracy of prediction over the accuracies of prediction of individual methods. 1. Introduction Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify interacting protein pairs experimentally in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific residues that contribute to the specificity and strength of protein interactions is an important problem 1-3 with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. The ability to make better predictions of protein binding sites can be used to enhance gene annotations by identifying specific protein binding sites within genes, i.e. at a finer level of 1

2 2 detail than whole protein identification. Experimental detection of residues in protein-protein interaction surfaces must come either from determination of the structure of protein-protein complexes or from functional assays, such as yeast two hybrid experiments. Computational efforts to identify interaction surfaces 4;5 have been limited to date, and need strengthening because experimental determination of protein structures in general, and protein-protein complexes in particular, lags behind the number of known protein sequences. Also, the relatively slow progress in high-throughput crystallography means that there is an increased urgency for improved computations in the protein structure field. In particular, computational methods for identifying residues that participate in protein-protein interactions will assume an increasingly important role 6;7 Based on the different characteristics of known protein-protein interaction sites, several methods have been proposed for predicting these sites using a combination of sequence and structural information. These include methods based on the presence of proline brackets, 8 patch analysis using a 6- parameter scoring function, 9 analysis of the hydrophobicity distribution around a target residue, 10 multiple sequence alignments, 11 structure-based multimeric threading, analysis of amino acid characteristics of spatial neighbors to a target residue using neural networks. Because structural information is not available for most protein complexes, development of reliable computational methods for identifying residues likely to participate in protein interactions, as is the intention here, - from amino acid sequence alone - is critical. By exploiting available databases of protein complexes, the data-driven discovery of sequence and structural correlates of protein-protein interactions will offer a potentially powerful approach. Our recent work has focused on the analysis of sequence neighbors of a target residue using SVM and Bayesian classifiers. None of the existing methods achieve predictions with accuracies higher than about 70%. There is an acute need for multi-faceted approaches that utilize available databases of protein sequences, structures, protein complexes, phylogenies, as well as other sources of information for the data-driven discovery of sequence and structural correlates of protein-protein interactions. 2. Methods 2.1. Initial dataset and definition of surface residues and interface residues. The initial dataset has been assembled from 155 unique proteins extracted from a set of 70 hetero-complexes in PDB. Two definitions of interface residues are considered in this research. The first definition is based on the reduction of

3 accessible surface area (ASA) of individual proteins upon complex formation, calculated using the DSSP program. The relative ASA of a residue is its ASA divided by its nominal maximum area as defined by Rost and Sander. 12 A residue is defined as a surface residue if its relative ASA is at least 25% of its nominal maximum area, except for cysteine whose threshold of the minimum relative ASA is lower. By this definition, 55% of the residues in a set of 115 proteins were surface residues, corresponding to a total of 12,676 surface residues. A surface residue is defined as an interface residue if its calculated ASA in the complex is less than that in the monomer by at least 1Å 2. Using these definitions, we obtained a dataset of 3,727 interface residues and 8,949 non-interface residues (i.e., surface residues that are not in the interaction sites). Thus, on average, interface residues represent 29% of surface residues, or 15% of total residues for proteins in our dataset Sequence and Structure Conservation. Amino acid sequences are conserved for many different reasons related to the structure and function of proteins: for stability, enzyme active sites, subunit interfaces, facilitating an essential motion (hinges), and binding sites. One of the most difficult problems is to develop methods to identify the reason for conservation for individual highly conserved residues. This is one of the reasons that a combination of approaches is more likely to permit identification of binding residues. Even identifying the conserved residues themselves is not completely straightforward. As will be seen (vide infra) different approaches will indicate individual residues being conserved to different extents. We intended to take advantage of this by using several methods for sequence and structure conservation. Our principal two methods for this purpose were based on phylogeny for sequence conservation and conservatism of conservatism for structure conservation that often identify different residues as being conserved Using Phylogeny to Identify Conserved Residues. Many computational tools have been developed for identifying amino acid residues that are important for protein function/structure, but there is no consensus regarding the best measure for evolutionary conservation. Evolutionary conservation can be decomposed into three components: (1) the overall selective constraints -- the number of changes observed at a site, (2) the pattern of amino acid substitutions the number of amino acid types observed at a site, and (3) the effect of amino acid usage. We established a reliable relationship between each measure and various aspects of structure. To explore the connection between sequence conservation and functional-structural importance, we proposed a new measure that can decompose the conservation into these three components. This measure is based on phylogenetic analysis. 3

4 4 The evolutionary rate at site k during lineage l from amino acids i to j (i,j=1, 20) can be expressed as λ kl (i,j) = c k a lk Q(i,j k), where c k accounts for the rate variation among sites, a lk for site-specific lineage (or subtree) effect caused by functional divergence, and the matrix Q(i,j k) is the (sitespecific) model for amino acid substitutions. The likelihood function for a given tree can be determined according to a Markov chain model. We have developed an integrated computer program that can map these predicted sites onto the protein surface to examine these relationships (DIVERGE at We used the solvent accessibility data from the DSSP to restrict the predicted conserved residues to those located on the protein surface Conservatism of Conservatism The phylogeny-based conservation of residues relies on sequence homology. It is well known, however, that many non-homologous proteins share similar folds. It is therefore highly desirable to study the conservation of residues in proteins based on the structural superimposition of non-homologous proteins. This is a different point of view regarding evolutionary conservatism from that in evolutionary considerations used for developing phylogenetic trees, but is complementary to phylogenetic studies, and the two together actually lead to better predictions of important protein binding residues. In order to obtain insight into the evolutionary conservation of residues in proteins we use the Conservatism of Conservatism method (CoC). The CoC method was developed by Mirny and Shakhnovich 13 for studying evolutionary conservation of residues in proteins with specific folds from the FSSP database. Proteins belonging to different biological families may have similar folds. Mirny and Shakhnovich 13 performed an analysis of conserved residues in several common folds. Residues (20 types) were subdivided into 6 different classes, in which, each class groups amino acids with similar chemical characteristics and frequencies of occurrence at different positions in multiple sequence alignments. The evolutionary conservatism within families of homologous proteins was measured through sequence entropy. Structural superimposition of different families of proteins with similar folds was used to calculate the so-called CoC for all positions of residues within a fold. We have applied this directly to our problem. We first calculated the sequence entropy at each position within a family of related 6 sequences from the HSSP database s() l = p ()log l p () l where p () l is the frequency of the class i of residues (for each of the six classes) at position l in sequence in the multiple sequence alignment. Then we used the FSSP database for the structural alignment. The structural superimposition of different families was used to calculate the conservatism-of-conservatism (CoC) i= 1 i i i

5 5 M 1 m Sl () = s() l where s m (l) is the intrafamily conservatism within the M m = 1 family m at position l. The CoC is the measure of the evolutionary conservation of the specific sites within the protein fold. The CoC method does not distinguish between residues at the protein surface evolutionarily conserved for functional reasons and residues inside the protein core that are conserved because of their importance to the folding process. We use the solvent accessibility data from the DSSP for the unbound molecules to eliminate conserved residues located inside the protein core Data Mining Approaches to Binding Site Identification. Recent advances in machine learning or data mining offer a promising approach to the data-driven discovery of complex relationships in computational biology. In essence, the data mining approach uses a representative data training set to extract complex a priori unknown relationships, e.g., sequence correlates of protein-protein interactions. Examination of the resulting classifiers can help generate specific hypotheses that can be pursued using molecular and biophysical methods. For example, a classifier that is able to identify binding residues on the basis of sequence or structural features can provide insights about sequence characteristics that contribute to important differences in function. The data mining approach to binding site identification consists of the following steps: Identify the surface residues from each protein. Label each residue from each protein as an interface residue or a noninterface residue based on appropriate criteria for defining residues in interaction sites. Use a machine learning algorithm, to train and evaluate a classifier to identify a target amino acid sequence residue as an interface or noninterface residue. Different types of information about the target residue (e.g., the identity and physico-chemical properties of its sequence neighbors, whether or not the target residue is a surface residue) can be supplied as input to the classifier. A variety of machine learning algorithms can be used for this purpose. Evaluate the classifier (typically using cross-validation) on independent test data (not used to train the classifier). Apply the classifier to identify putative interaction residues in a protein given its sequence (and possibly its structure, but not the structure of its interaction partner). Our studies have used the support vector machine (SVM) learning algorithm for this task because SVMs are well-suited for the data-driven construction of high-

6 6 dimensional patterns and are especially useful when the input is a real-valued pattern because SVMs are well-suited for the data-driven construction of highdimensional patterns and are especially useful when the input is real-valued. In addition, algorithms for constructing SVM classifiers effectively incorporate methods to avoid over-fitting the training data, thereby improving its generality, i.e., the performance of the resulting classifiers on test data. Support vector machine algorithms have proven effective in several applications, including text classification, gene expression analysis using microarray data, and predicting whether or not a pair of proteins is likely to interact Threading of Sequences through Structures of Surfaces. In methods , the properties of the protein-protein interface were deduced by concentrating on the sequence information contained in the protein pair under investigation. However, it is well accepted that the physical origin of the specificity of protein-protein interactions comes predominantly from their structures. Thus, in any thorough investigation of protein-protein interactions, it is essential to include information from structural studies. In this section, we emphasize predictions that are based on the structure of the protein-protein interface. It is known that the geometry of the interfaces can undergo substantial relaxation as a result of the binding process and protein docking studies using rigid templates are not fully successful. We adapted methods employed in protein structure recognition to the problem of protein-protein interfaces. In the first stage, structural models for protein-protein interfaces were generated from existing protein databank (PDB) structures by extracting portions of contacts between different protein chains. We found that if we define the interaction region by the criteria that backbone C α atoms on the two interacting chains are less that 15 Å apart, it gives reasonably connected fragments suitable for threading studies. In the second stage, having identified a set of candidate structures, threading studies are to be performed to examine the probability that a given model resembles the real interface. Through these studies, we identified parts of the two proteins that are in the interfacial region and predicted the most probable residue-residue contacts between the two proteins Ensemble Predictions for Combining Results from Multiple Methods. Different approaches to identify binding sites from amino acid sequence information yield different (sometimes contradictory, sometimes complementary) results. In such cases, approaches for combining results from multiple predictors have potential importance. The key idea is that results obtained using different approaches, which we shall call classifiers henceforth, may be correlated (or, more generally, statistically dependent) due to a variety of reasons including the use of common datasets for constructing or tuning classifiers, use of intermediate variables for encoding inputs to the classifiers,

7 and similarities between methods (e.g., SVM, neural networks). Regardless of the source of statistical dependency, the goal is to develop methods for weighting the output of each classifier appropriately for the purpose of producing more accurate predictions. Our method takes as input the binary (True/False) output of each classifier (e.g., SVM, CoC) and produces as output a probability that the residue under consideration is a binding residue, using the outputs produced by each of the classifiers. Algorithms for learning Bayesian (or Markov networks) can be then used to learn the network of dependences and the relevant conditional probabilities General Evaluation measures for assessing the performance of classifiers. Let TP denote the number of true positives -- residues predicted to be interface residues that are actually interface residues, TN the number of true negatives -- residues predicted not to be interface residues that are in fact not interface residues, FP the number false positives -- residues predicted to be interface residues that are not in fact interface residues, FN the number of false negatives -- residues predicted not to be interface residues that actually are interface residues. Let N = TP+TN+FP+FN. Sensitivity (recall) and specificity (precision) are defined for the positive (+) class as well as the negative (-) class. Sensitivity + = TP/(TP+FN), Sensitivity - = TN/(TN+FP), Specificity + = TP/(TP+FP), Specificity - =TN/(TN+FN). Overall sensitivity, specificity, and false alarm rate, correspond to expected values of the corresponding measures over both classes. The performance of the classifier is summarized by the Correlation Coefficient, which is given by ( TP TN FP FN ) /[( TP + FN )( TP + FP)( TN + FP)( TN + FN )] 1/2, ranges from -1 to 1 and is a measure of how predictions correlate with the actual data. 3. Results 7 For the results presented here we have used 7 proteins from the PDB, together with their sequence homologs. These proteins are 1acb_E (which abbreviates the chain E of the protein 1acb), 1fle_E, 1hia_A, 1avw_A, 2sic_E, 3sgb_E, and 4cpa. Figure 1 shows an example of Proteinase B from Streptomyces Griseus (3sgb). The protein chain 3sgb_E is shown in ribbons and the inhibitor 3sgb_I is shown at the top in wireframe. The residues of the chain 3sgb_E that are actual or predicted binding sites are marked with different colors: red - true positives, yellow - false negatives, dark blue - false positives.

8 8 Figure 1. Three-dimensional structure of Proteinase B from Streptomyces Griseus (3sgb) The target protein (proteinase 3sgb_E) is shown in ribbons, with residues color coded as: red - true positives, yellow - false negatives, dark blue - false positives. The inhibitor partner is shown at the top in wireframe Phylogeny We have used the ClustalX multiple sequence alignments of protein sequences to generate phylogenetic trees. We defined as the conserved residues those that were identical at a given position in the multiple sequence alignment in more than 85% alignments (i.e. only 15% mutations or gaps were allowed). We have found that the cutoff value 85% gives the optimal results. Because the phylogenetic trees don t always show much evolutionary diversity frequently there are many residues that satisfy this condition. Additionally we have filtered the results by removing conserved residues hidden inside the protein core, i.e., with low solvent accessibility. Table I show the results of the prediction of the binding sites for for Proteinase B from Streptomyces Griseus (3sgb_E) for five different methods:. 1 Phylogeny, 2 -CoC, 3 - SVM, 4 - Threading and 5 - Consensus. The first line contains information about the position of residues along the chain. The second line shows the amino acid sequence. The actual binding site residues are identified in the sequence line in reverse color on the basis of reduction in surface area upon binding. The next five lines show the predictions of binding sites by different methods. The predicted binding sites are marked by P (phylogeny) (not present for 3sgb_E and therefore the first line below the sequence line is always empty), C (for the CoC prediction), D (for the data mining prediction by SVM), T (for the prediction by threading) and E (for the consensus method). The results from Table I are summarized in Table II that shows the numbers of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN), overall sensitivity, overall selectivity and the correlation coefficient for each of the methods. Because the number of TPs and FPs for the 3sgb_E are zero the correlation coefficient (that ranges from 1 to +1) for the phylogenetic trees prediction is zero. Table III shows overall average results for the correlation coefficient, sensitivity and selectivity

9 for seven protein complexes studied in this work.. Two kinds of averages are considered < > the numerical average over the 7 proteins in the dataset, and < > n average over the number of residues. TABLE I. Example results for Proteinase B from Streptomyces Griseus (3sgb_E) for five different methods:. 1 Phylogeny, 2 -CoC,, 3 - SVM, 4 - Threading and 5 - Consensus. The actual binding site residues are identified in the sequence line in reverse color. The predicted binding sites are marked by P (phylogeny) (not present for 3sgb_E), C (CoC), D (data mining by SVM), T (threading) and E (consensus) ISGGXXXXXXXXXDAIYSSXXXXTGRCSLGFNVTYYFLTAGHCTDXGATTWWAXXXXXXX TTTT C D E T XXNSARTTVLGTTSGSXSFXXXXPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMA D E C C C CC C D T T E VTRRGSTTXXXXXXXXXXXXGTHSGSVTALNATVNYGGGDVVYGMIRTNXXXXXVCAGDS CC C CCCCCCC CC D D D D T TTTTT TT T E EEEEE E E GGPLYSGXXXXTRAIGLTSGGSGNCSSGGTTFFQPVTEALAYGVSVY CCCCCCCCCC C CCC D D D TTTTTTT TT EEEEEEEE E E Table II. Summary results for Proteinase B from Streptomyces Griseus (3sgb_E) 3SGBE TP TN FP FN Sensitivity Selectivity Correlation 1. Phylog COC SVM Thread Cons

10 10 Table III. Overall average results for selectivity, sensitivity and the correlation coefficient averaged over the 7 proteins in the dataset. Averages over the number of residues are denoted by < > n. Method <Selec> <Selec> n <Sens> <Sens> n <Cor> <Cor> n 1.Phylog CoC SVM Thread Cons Although, phylogenetic predictions are not the best ones, as seen from Table III, utilizing the phylogenetic approach in combination with other methods can lead to significantly improved predictions Conservatism of Conservatism To detect structurally conserved residues that are possible binding sites we used the Conservatism of Conservatism method (CoC) developed by Mirny and Shakhnovich. 13 We used the FSSP (Fold classification based on Structure- Structure assignment of Proteins developed by Holm and Sander, 1993) alignments for the studied set of 7 protein complexes. For each family the HSSP alignments were used to calculate the sequence entropy at each position of the alignment, and the structural alignment (FSSP) was used to calculate the value of CoC. All residues in the protein chain were ranked according to the values of CoC at a given position in the sequence. We have defined as conserved the top 3/4 of the total number of residues ranked according to their value of CoC. We then have filtered the results of the CoC ranking by removing all structurally conserved residues located inside the protein core. The results of the prediction for 3sgb_E are shown in Table I. Each C indicates a predicted binding residue. The overall performance of the CoC method is summarized in the second row of Tables II and III. The CoC method alone is clearly not sufficient to give a successful prediction of the binding sites, and combining this method with other prediction techniques in the consensus method gives improved results Data Mining for Binding Residues We generated a classifier to determine whether or not a surface residue is located in the interaction site using the sequence neighbors of the target residue. An 11-residue window consisting of the residue and its 10 sequence neighbors (5 on each side) was empirically chosen. Each amino acid in the 11 residue window is represented using 20 values obtained from the HSSP profile

11 ( ) of the sequence. The HSSP profile is based on a multiple alignment of the sequence and its potential structural homologues. Each target residue is therefore associated with a 220 (11 20) element vector. The SVM learning algorithm is given a set of labeled examples of the form (X, Y) where X is the 220 element vector representing a target residue and Y is its corresponding class label, either an interface residue or a non-interface residue. The learning algorithm generates a classifier which takes as input a 220 element vector that encodes a target residue to be classified and outputs a class label. Our previous study reported results for classifiers constructed using a combined set of 115 proteins belonging to six different categories of complexes: antibodyantigen, protease-inhibitor, enzyme complexes, large protease complexes, G- proteins, cell cycle signaling proteins, signal transduction, and miscellaneous. Our more recent study trained separate classifiers for each major category of complexes (e.g., protease-inhibitor complexes). In the case of protease-inhibitor complexes, 19 leave-one-out cross-validation experiments were performed. In each experiment, an SVM classifier was trained using a training set of surface residues labeled as interface versus non-interface residues from 18 of the 19 proteins. The resulting classifier was used to classify the surface residues from the remaining target protein into interface residues and non-interface residues. The predicted binding sites for 3sgb_E are marked in Table I by D. The performance of the resulting classifiers for the current test set of complexes is summarized in row 3 of Tables II-III. We always find that the classifiers trained on each category of complexes outperform the classifiers trained on the combined dataset, which is justification for why we are proposing to create separate databases for different categories of proteins Threading of Sequences through Structures of Surfaces We have performed threading studies for the set of 7 protease-inhibitor complexes. From each complex structure, we first extract the interfacial region as described earlier. The residue-residue contacts in the interfacial region are described with contact matrices. Threading was performed with the same algorithm that we used previously for single proteins in the CASP5 competition. For each sequence, threadings were performed with other structures in the database and alignment results used only when the score exceeds a threshold. From the alignments, the residues which are involved in contacts at the interface are identified and a scale generated depending on the number of times the residue is indicated and the strength of the threading score. The predicted binding sites for 3sgb_E by the threading method are marked in Table I by T. Rows 4 of Tables II-III summarize the predictions. It shows that the threadingbased approach is somewhat more successful than other methods in some categories, but the best performance comes by combining it with other prediction methods in the consensus method (rows 5). 11

12 Consensus Results and Discussion Based on the results of the predictions with the four independent methods we have developed a simple consensus method to yield the best prediction. In the present consensus approach, a site is considered to be an interaction site if: at least three independent methods predict it, or any two methods predict the binding site (except the Phylogeny-Threading pair), or if only a single method predicts the binding site this method happens to be SVM. The results of the consensus are marked in Table I by E, while the last rows in Tables II-III summarize the performance of the consensus method. Consensus results are generally improved, and our data (not shown here because of space limitations) show that the improvements in the predictions can be even more remarkable when the individual predictions of all four methods are relatively weak. This shows that an excellent approach for the prediction of the binding sites in protein complexes is to combine diverse prediction methods. Each of the four prediction methods presented in this paper sheds a different light on the conservation and prediction of the binding sites, but none of the methods taken separately is as powerful as a combination of all methods together. The simple consensus method presented here could be significantly improved by using the ensemble predictions with more detailed probabilities. This problem in more detail will be studied by us in the future. Acknowledgments The financial support through the NIH grant 1R21GM is acknowledged by V. Honavar, D.L. Dobbs and R.L. Jernigan References 1. Chothia, C.; Janin, J. Nature 1975, 256, Yan C.; Dobbs, D.; Honavar, V. Springer-Verlag: Berlin, 2003; pp Yan, C., Dobbs, D., and Honavar, V. "Predicting protein-protein interaction sites from amino acid sequence," Iowa State University Benner, S. A., et al.. J.Mol.Biol. 1994, 235, Bock, J. R.; Gough, D. A. Bioinformatics 2001, 17, Teichmann, S. A.; Murzin, A. G.; Chothia, C. Curr.Opin.Struct.Biol. 2001, 11, Valencia, A.; Pazos, F. Curr.Opin.Struct.Biol. 2002, 12, Kini, R. M.; Evans, H. J. FEBS Lett 1996, 385, Jones, S.; Thornton, J. M. J Mol.Biol. 1997, 272, Gallet, X,.et al, R. J Mol.Biol. 2000, 302, Pazos, F.et al. J Mol.Biol. 1997, 271, Rost, B.; Sander, C. Proteins 1994, 20, Mirny, L. A. and Shakhnovich, E. I, J Mol.Biol. 1999, 291,

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Detection of Protein Binding Sites II

Detection of Protein Binding Sites II Detection of Protein Binding Sites II Goal: Given a protein structure, predict where a ligand might bind Thomas Funkhouser Princeton University CS597A, Fall 2007 1hld Geometric, chemical, evolutionary

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References

More information

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem University of Groningen Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's

More information

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

An optimized energy potential can predict SH2 domainpeptide

An optimized energy potential can predict SH2 domainpeptide An optimized energy potential can predict SH2 domainpeptide interactions Running Title Predicting SH2 interactions Authors Zeba Wunderlich 1, Leonid A. Mirny 2 Affiliations 1. Biophysics Program, Harvard

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

Detecting unfolded regions in protein sequences. Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France

Detecting unfolded regions in protein sequences. Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France Detecting unfolded regions in protein sequences Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France Large proteins and complexes: a domain approach Structural studies

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

A two-stage classifier for identification of protein protein interface residues

A two-stage classifier for identification of protein protein interface residues BIOINFORMATICS Vol. 20 Suppl. 1 2004, pages i371 i378 DOI: 10.1093/bioinformatics/bth920 A two-stage classifier for identification of protein protein interface residues Changhui Yan 1,2,5,, Drena Dobbs

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction Jakob V. Hansen Department of Computer Science, University of Aarhus Ny Munkegade, Bldg. 540, DK-8000 Aarhus C,

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data Data Mining and Knowledge Discovery, 11, 213 222, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. DOI: 10.1007/s10618-005-0001-y Accurate Prediction of Protein Disordered

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Receptor Based Drug Design (1)

Receptor Based Drug Design (1) Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008 Journal Club Jessica Wehner Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008 Comparison of Probabilistic Combination Methods for Protein Secondary Structure

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Supplementary Information

Supplementary Information Supplementary Information Performance measures A binary classifier, such as SVM, assigns to predicted binding sequences the positive class label (+1) and to sequences predicted as non-binding the negative

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

Conditional Graphical Models

Conditional Graphical Models PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing

More information

Docking. GBCB 5874: Problem Solving in GBCB

Docking. GBCB 5874: Problem Solving in GBCB Docking Benzamidine Docking to Trypsin Relationship to Drug Design Ligand-based design QSAR Pharmacophore modeling Can be done without 3-D structure of protein Receptor/Structure-based design Molecular

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction Institute of Bioinformatics Johannes Kepler University, Linz, Austria Chapter 4 Protein Secondary

More information

Predicting the Binding Patterns of Hub Proteins: A Study Using Yeast Protein Interaction Networks

Predicting the Binding Patterns of Hub Proteins: A Study Using Yeast Protein Interaction Networks : A Study Using Yeast Protein Interaction Networks Carson M. Andorf 1, Vasant Honavar 1,2, Taner Z. Sen 2,3,4 * 1 Department of Computer Science, Iowa State University, Ames, Iowa, United States of America,

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Burkhard Rost and Chris Sander By Kalyan C. Gopavarapu 1 Presentation Outline Major Terminology Problem Method

More information

Visualization of Macromolecular Structures

Visualization of Macromolecular Structures Visualization of Macromolecular Structures Present by: Qihang Li orig. author: O Donoghue, et al. Structural biology is rapidly accumulating a wealth of detailed information. Over 60,000 high-resolution

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007 Molecular Modeling Prediction of Protein 3D Structure from Sequence Vimalkumar Velayudhan Jain Institute of Vocational and Advanced Studies May 21, 2007 Vimalkumar Velayudhan Molecular Modeling 1/23 Outline

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

#33 - Genomics 11/09/07

#33 - Genomics 11/09/07 BCB 444/544 Required Reading (before lecture) Lecture 33 Mon Nov 5 - Lecture 31 Phylogenetics Parsimony and ML Chp 11 - pp 142 169 Genomics Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Applications of multi-class machine

Applications of multi-class machine Applications of multi-class machine learning models to drug design Marvin Waldman, Michael Lawless, Pankaj R. Daga, Robert D. Clark Simulations Plus, Inc. Lancaster CA, USA Overview Applications of multi-class

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Chemical properties that affect binding of enzyme-inhibiting drugs to enzymes

Chemical properties that affect binding of enzyme-inhibiting drugs to enzymes Chemical properties that affect binding of enzyme-inhibiting drugs to enzymes Introduction The production of new drugs requires time for development and testing, and can result in large prohibitive costs

More information

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

In silico pharmacology for drug discovery

In silico pharmacology for drug discovery In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Computational Biology From The Perspective Of A Physical Scientist

Computational Biology From The Perspective Of A Physical Scientist Computational Biology From The Perspective Of A Physical Scientist Dr. Arthur Dong PP1@TUM 26 November 2013 Bioinformatics Education Curriculum Math, Physics, Computer Science (Statistics and Programming)

More information

Basics of protein structure

Basics of protein structure Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu

More information

Automatic Epitope Recognition in Proteins Oriented to the System for Macromolecular Interaction Assessment MIAX

Automatic Epitope Recognition in Proteins Oriented to the System for Macromolecular Interaction Assessment MIAX Genome Informatics 12: 113 122 (2001) 113 Automatic Epitope Recognition in Proteins Oriented to the System for Macromolecular Interaction Assessment MIAX Atsushi Yoshimori Carlos A. Del Carpio yosimori@translell.eco.tut.ac.jp

More information

Structural Bioinformatics (C3210) Molecular Docking

Structural Bioinformatics (C3210) Molecular Docking Structural Bioinformatics (C3210) Molecular Docking Molecular Recognition, Molecular Docking Molecular recognition is the ability of biomolecules to recognize other biomolecules and selectively interact

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Motif Extraction and Protein Classification

Motif Extraction and Protein Classification Motif Extraction and Protein Classification Vered Kunik 1 Zach Solan 2 Shimon Edelman 3 Eytan Ruppin 1 David Horn 2 1 School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel {kunikver,ruppin}@tau.ac.il

More information

proteins Comparison of structure-based and threading-based approaches to protein functional annotation Michal Brylinski, and Jeffrey Skolnick*

proteins Comparison of structure-based and threading-based approaches to protein functional annotation Michal Brylinski, and Jeffrey Skolnick* proteins STRUCTURE O FUNCTION O BIOINFORMATICS Comparison of structure-based and threading-based approaches to protein functional annotation Michal Brylinski, and Jeffrey Skolnick* Center for the Study

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Introduction to Computational Structural Biology

Introduction to Computational Structural Biology Introduction to Computational Structural Biology Part I 1. Introduction The disciplinary character of Computational Structural Biology The mathematical background required and the topics covered Bibliography

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Analyzing six types of protein-protein interfaces. Yanay Ofran and Burkhard Rost

Analyzing six types of protein-protein interfaces. Yanay Ofran and Burkhard Rost Analyzing six types of protein-protein interfaces Yanay Ofran and Burkhard Rost Goal of the paper To check 1. If there is significant difference in amino acid composition in various interfaces of protein-protein

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Protein tertiary structure prediction with new machine learning approaches

Protein tertiary structure prediction with new machine learning approaches Protein tertiary structure prediction with new machine learning approaches Rui Kuang Department of Computer Science Columbia University Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) NEC

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu

More information

Insights into Protein Protein Interfaces using a Bayesian Network Prediction Method

Insights into Protein Protein Interfaces using a Bayesian Network Prediction Method doi:10.1016/j.jmb.2006.07.028 J. Mol. Biol. (2006) 362, 365 386 Insights into Protein Protein Interfaces using a Bayesian Network Prediction Method James R. Bradford 1, Chris J. Needham 2, Andrew J. Bulpitt

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical

More information