Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Size: px
Start display at page:

Download "Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach"

Transcription

1 Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology (CIT), Islamic University of Technology (IUT), Board bazar, Gazipur-1704, Dhaka, Bangladesh Abstract In recent years representatives (e.g. RS126, CB513) of the protein sequences from the protein data bank (e.g. RS126, CB513) are used to predict the secondary structure of the protein. This structure describes the 3D structure of the protein. 3D structure defines the proper functionality of the protein so that it can be used to discover new drugs more accurately. This paper proposes a new method for identifying representatives of protein sequence and SVM (Support Vector Machine) has been used that can be utilized for predicting secondary protein structure more accurately and exactly. This paper uses CATH database as protein data bank. It uses protein sequences divided into 1110 super-families. Proposed method identifies 1110 representatives from these super-families by using Partitioning algorithm. Keywords: Protein, Secondary Structure, Representative, CATH database, SVM (Support Vector Machine) 1 Introduction Prediction of secondary structure of protein is very important for discovering new drugs. Secondary structure of a protein represents the 3D structure of the protein that describes the characteristics of the protein. Based on the characteristics, proteins are used to discover new drugs. So it is important to know the secondary structure of the unknown protein. At present, there exists a severe information asymmetry between the researchers who are working on the sequencing of organism genomes and those in elucidating the 3D structure of biomolecules. On one hand, there are more and more genomes being sequenced, and on the other hand, protein structure information accumulation at Protein Data Bank (PDB) (Berman et al., 2002) is growing very slowly. Since the structural determination in PDB is heavily relied on the experimental methods such as x-ray crystallography and NMR. These methods are accurate but it is expensive and time consuming. It takes long time to determine the structure of one unknown protein sequence. So it is impossible to use this method for all the sequences in PDB. Because of this, protein structure prediction by homology modeling or computer simulating is therefore emerging as an alternative or complementary approach by using small domain from the total protein sequences of any protein database. There are proteins in the CATH v3.0.0 database. These are divided into Classes and Architectures. Architecture has topologies. Proteins those share similar secondary 3D structure are grouped in a single topology. The whole protein world is divided into such 1110 topologies. Basically these 1110 topologies represent the whole protein world and its easier to sample and work with them instead of the proteins. [1] In this research a method is proposed for finding a list of protein that will represent each of these topologies, namely representative of each topology. To find a representative from a topology first to identify the ideal characteristics present in that topology that represents all the proteins under that topology. Then using those ideal characteristics, a model need to be built that will match with all the proteins to find those ones that hold the most ideal characteristics to consider them as a candidate. From the candidate list Final representative is selected. Our algorithm does the whole process. 2 Existing Approaches The Rost & Sander dataset [2] which has been used to train several early secondary structure prediction methods like PHD [2] and PREDATOR [3] contains 126 proteins that share a pairwise sequence identity of below 25%. But, as shown by Cuff and Barton [4], it contains some clear homologues if homology is not only measured on a simple sequence identity basis. Based on their studies Cuff and Barton proposed a set of 513 protein chains in 1999 [4] (in the following referred to as CB513 set). This set contains the 117 non-homologous chains of the Rost and Sander (RS117) set as well as 396 additional protein chains (CB396) carefully selected to exclude homologues as far as possible. The structural resolution of all proteins in the set is better than or equal to 2.5 Å. In total the set contains amino acid residues, with a helix content of 34.5%, a sheet content of 22.7%, and a coil content of 42.8%. For our studies we will use the CB513 protein set to do an initial training and testing of our prediction method as well as to optimize several parameters. Therefore, the CB513 set is split into the CB396 set which will be used for training and the RS117 set which will be used for testing purposes. Since the proteins in the CB513 set do not provide a comprehensive sampling of the fold space known today, our final prediction method is trained on a larger set of proteins. This final set is based on the SCOP [5] (Structural Classification of Proteins) database. Similar to a method to compile a protein dataset described by Cuff and Barton [4] we retrieved one representative protein chain for each of the 1290 SCOP superfamilies in release 1.65 (December 2003) from the ASTRAL [6] database. From this list, all superfamilies

2 containing only members whose structure was resolved by NMR technique or with a resolution worse than 2.5 Å as well as superfamilies belonging to the classes for transmembrane and multi-domain proteins (classes E and F) were removed, resulting in a final dataset of 940 protein chains that are a representative subset of the fold space known today. The final database, named SCOP-SFR, contains 219 all-alpha, 202 allbeta, 190 α/β, 281 α+β and 48 small proteins with residues in total and a helix content of 36.79%, a sheet content of 22.78% and a coil content of 40.42%. A fair comparison of our approach with other methods is difficult, since for most methods it is not published which proteins were used to train the method in its current state. Therefore only a blind test comparison provides a fair testing setup when comparing our method against others. This need is supplied by the EVA server [7] which provides a set of proteins which were published recently together with blind test secondary structure predictions of those proteins made by important prediction servers like PSI-PRED and PHD. Since the SCOP version 1.65 has been released in December 2003, all proteins added to the EVA server after that time (01. January November 2004) are blind test examples for our prediction method which is trained only on proteins available in the SCOP 1.65 release. This result in a test set of 105 proteins (named EVA105), with 9837 residues in total, for which predictions of PSI-PRED, PHD and PROF-SEC are available. 3 Method and Data Preparation An authentic reference dataset is needed for any learning method and in this paper all the data are collected from CATH v3.0.0 database, where 86,151 domains are divided into 1110 topologies. The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank [1]. The database has four major levels of hierarchy or organization: Class, Architecture, Topology (fold family) and Homologous superfamily [8]. In order to generate training dataset for the SVM classifier, protein sequences need to be collected in such a manner where the sequence represents each topology. In standard method for predicting secondary structure, there are three major parts which are listed below, 1. Generate datasets or representatives of primary sequences from databank 2. Sequence to Structure layer including the detection of frequent amino acid patterns, the feature representation of those patterns, the selection of interesting features and how we use the features to predict the secondary structure of a protein with unknown structure. The sequence to structure prediction method developed in this thesis employs Support Vector Machines in a similar way as discussed by Liu et al. [15] and Ward et al. [16]. 3. Features and methods employed in Structure to Structure layer. 3.1 Representative Protein Extraction Identifying the protein sequences which hold or represent the characteristics of a given topology is the first step for the analysis. For this a model needs to be developed that will match with all the proteins to find the ones that hold the most ideal characteristics to be considered as a candidate. From the candidate list the final representatives are selected, where domains are divided into 1110 topologies. Each topology contains a certain number of proteins; let that be N. Dividing each protein sequence of length L into three equal or almost equal part in terms of length produces three segments Ψ 1, Ψ 2, Ψ 3. Where L1, L2 and L3 is the length of the segment Ψ 1, Ψ 2, Ψ 3 respectively and L=L1+L2+L3. Let R= LMOD3, then L1=L2=L3-R. Total Primary Protein Sequences in CATH database : Total Topologies in CATH database : 1110 So, the target is to find 1110 representatives from Primary Protein Sequences. Total Frequency Matrix Generation (For each Topology) 1. In each topology: Topology: Helicase, Ruva Protein; domain 3 Total Primary Protein Sequences in this Topology : 370 i. Divide each Primary Sequence into 3 equal length parts: Primary Protein Sequence : PAPTPSSSPVPTLSPEQQEMLQAFSTQSGMNLEWSQKC LQDNNWDYTRSAQAFTHLKAKGEIPEVAFMK Length : 69 3 Equal Length Part : 69/3=23 Another If the Length (for example 14) is not divisible by 3 then 14/3=4, and Remainder is 2. Part 1: 4 Part 2: 4 Part 3: 4+Remainder (2) =6. So, we get From: PAPTPSSSPVPTLSPEQQEMLQAFSTQSGMNLEWSQKC LQDNNWDYTRSAQAFTHLKAKGEIPEVAFMK (Length: 69) To: Part 1 (23) : PAPTPSSSPVPTLSPEQQEMLQA (Ψ1) Part 2 (23): FSTQSGMNLEWSQKCLQDNNWDY (Ψ2) Part 3 (23): TRSAQAFTHLKAKGEIPEVAFMK (Ψ3) ii. For each part of sequences count the number of appearance of each of 20 Amino Acids Part 1 : PAPTPSSSPVPTLSPEQQEMLQA Table 1 Number of Appearances of 20 Amino Acids: i. G: 0 xi. S: 4 ii. A: 2 xii. T: 2 iii. P: 6 xiii. C: 0 iv. V: 1 xiv. N: 0 v. L: 2 xv. Q: 3 vi. I: 0 xvi. K: 0 vii. M: 1 xvii. H: 0 viii. F: 0 xviii. R: 0 ix. Y: 0 xix. D: 0 x. W: 0 xx E: 2

3 Calculate this for other 2 Parts also. iii. The binary characteristics matrix B for a segment (e.g. Ψ 1, Ψ 2, Ψ 3 ) of a particular sequence has dimension i*j where j is the 20 amino acids and i is the number of occurrence of an amino acid in the specific segment. That means store Number of Appearance of all 20 Amino Acids into a Binary Matrix. Row (Number of Appearance), Column (20 Amino Acids) B can be defined as 1if jth amino acid occurs i times B ij = { (1) 0 otherwise Thus there will be three such matrices for a particular sequence. Combining each segment (e.g. Ψ1, Ψ2, Ψ3) individually for all sequences within a topology the Total frequency matrices are obtain. There will be three such Total Frequency Matrices T for the entire topology with the dimension same as B matrix and they are defined as T ij = Bij (2) ij A weight matrix W is then generated for each of 3 T matrices with the same dimension as T matrix. It can be defined as Wij = ( Tij / N) *100 (3) Now using the equation (2) and (3) the Total characteristics matrix TC is generated for each segment (Ψ1, Ψ2, Ψ3). This TC matrix holds all information about the characteristics of that topology. Therefore there will be three matrices for the entire topology with the same dimension as T matrix and they are defined as Tij* Wij for i = 0 TCij = (4) Tij * W ij*i(number of occurance) otherwise From these three TC matrices three Binary Ideal Characteristics matrices IC is generated for the topology, which will represent the ideal/representative characteristics. IC matrix can be defined as 1if TCij holds the highest value of column j IC ij = { (5) 0 otherwise Finally these IC matrices are used for the selection of candidate protein for representative of topology. Using strict/loose matching technique each three segments of each of the proteins sequences is matched with the one corresponding IC matrix of three. The sequence that matches most with the IC matrices is selected as a candidate to be representative of that topology. In a strict matching technique the binary characteristics B matrix of each sequence is compared with the IC matrix and checked whether they are identical i.e. mismatch error is 0. A mismatch counter is calculated and the mismatch threshold is set to maximum six. For calculating mismatch error for a protein sequence using strict matching technique comparison between B matrix and IC matrix for each segment is required (e.g. B matrix of Ψ1 with IC matrix for Ψ1 and so on ) : (6) Then the Ψ1 is calculated by summing up all the j value of all 20 amino acid column of the Ψ1 segment. Similarly Ψ 2 and Ψ3 are calculated for Ψ2, Ψ3. (7) To calculate the total mismatch error across all the three segments for a protein sequence (8) If only one sequence with min<=6 is found then it is considered as the representative of that topology, if there is more than one sequence found with the same min a candidate list is generated. From that candidate list one is selected as representative of that topology. If no sequences come across to meet within the threshold value a loose matching approach is then followed. In a loose matching approach the threshold value is increased to sixty (3 Matrices x 20 Amino Acids = 60). To calculate mismatch error for a protein sequence using loose matching technique comparison between B matrix and IC matrix for each segment is required (e.g. B matrix of Ψ1 with IC matrix for Ψ1 and so on): (9) Then the Ψ1 is calculated by summing up all the j value of all 20 amino acid column of the Ψ1 segment/part. Similarly Ψ2 and Ψ3 are calculated for Ψ2, Ψ3. (10) To calculate the total mismatch error across all the three segments for a protein sequence (11) If only one sequence with min is found then it is considered as the representative of that topology, if there is more than one sequence found with the same min a candidate list is generated. Then we have to refine this candidate list using refining process. The refining process is an iterative process. Now iterate the whole process until now (Calculating IC matrix of a topology, strict matching/ loose matching) N=number of proteins in the candidate list Iterate the whole process on the list of N number of protein found in candidate list If the number of candidates in the list reduces per iteration continue refining iteration If the number of candidates in the list doesn t decrease by 2 consecutive iterations, consider it as Final Candidate list. If only one sequence in the candidate list then it is considered as the representative of that topology, if there is more than one select one of them as representative of that topology. 3.2 Feature Extraction According to the CATH database the 1110 topologies are divided into 4 divisions based on their structural types: Mainly Alpha (α), Mainly Beta (β), Alpha Beta (α-β) and Few Secondary Structures (fss). Because the representative sequences, which are selected during the previous stage (Representative Protein Extraction), were initially collected from CATH database, their corresponding structural information can also be obtained from the database itself. This aspect helps the process of preparing model for the SVM. Taking four amino acids at a time from 20 different amino acids all possible combinations are constructed. So a total of

4 160,000 different quadramer are created (20*20*20*20). Taking all representatives of a particular structural type a Current list of representative is created while the representatives of other 3 structural types are stored in Other list. The frequency of each of this quadramer in both list is then calculated. Let fci,j be the frequency of ith quadramer in jth candidate sequence from the Current list and foij be the frequency of ith quadramer in jth candidate sequence from the Other list. A difference matrix (Diffi) is computed based the the absolute difference between the number of occurrences of those 160,000 possible quadramers of amino acids in Current and Other list. Diffi= fci,j - foij Where, Diffi is the absolute difference of the occurrence of the ith quadramer. Sorting Diffi, in descending order and first 4000 quadramers with the highest frequency are selected as feature set for SVM training. For normalization of values of these features the minimum and maximum frequency of each of these top 4000 quadramers from both Current and Other list are taken into account. So a normalization parameter normi can be obtained according to the following equation. 3.3 SVM Training SVM, a supervised machine-learning technique has been used for computational biological problems as it can handle computationally expensive and noise data in a very efficient way which occurs very frequently in biology. It can also solve multi-class classification problems using the structural minimization principle. Given a training set in a vector space, SVM can find the best decision hyper plane, which separates two classes. The quality of the decision hyper plane depends on the difference margin between the two hyperplanes defined by the SVM [13, 14]. For the SVM training of the proposed method the Libsvm (version-2.84) is used [9]. Libsvm is available free, simple, easy-to-use, and efficient software for SVM classification and regression. It is also fast and memory efficient implementation of a SVM. The training is done using the 1110 representatives from the CATH database selected thorough the representative selection procedure discussed before. These representatives are classified in four secondary classes. The training and testing of the proposed method is done in two ways, one-against-one classification and oneagainst-others classification or multi-class prediction [10]. Table 2 The training data set built by taking the representative protein sequences from each T-level fold family of 1110 total topology families Classes Number of fold families Range of chain length Α Β α- β Fss Result and Analysis (12) In the One-against-One classification one is trained on data from two classes [11]. A binary classifier is constructed which maps examples of one of the class to +1 and the other to 1. The prediction accuracy of each one-against-one classification using proposed method is shown in Table 2. SVM parameter S=1 and T=2 are used for this test. The prediction accuracy of each one-against-one classification using existing best known SVM method is shown in Table 3[12]. Table 3 Prediction accuracy of one-against one classification of proposed method vs existing method Accuracy Classifiers (%) Proposed Existing Method Method[12] α vs. β α vs. α β α vs. fss β vs. α β β vs. fss α β vs. fss Average Accuracy Table 4 The binary classification accuracies of class folds (one-against-other) of proposed method vs existing method Classifier α vs. other (including β, α β and fss) β vs. other (including α, α β and fss) α β vs. other (including α, β and fss) fss vs. other (including α, β and α β) Accuracies of dipeptide frequency (%) Proposed Existing Method Method[12] Average accuracies Four One-against-Others classifications are used in this proposal and the prediction accuracy displays promising result for most of them. The results are presented in Table 3. Two pair of optimized parameter such as (S=1, T=2 for α vs. other and α β vs. other) and (S=0, T=2 for β vs. other and fss vs. other ) are used for this test. From these two pair we the best accuracy for the multi-class is taken. The classifier fss vs. others gives the highest prediction accuracy, being about 92% and The β vs. others also gives good accuracy, in the range of 82% 83% for the parameter (S=0,T=2). The classifier α vs. others, for example, only give about 84% accuracy, and the another classifier α β vs. other gives around 88% accuracy for the parameter (S=0, T=2). The Table 4 displays remarkable result for all four classes of protein. Also 150 random protein sequences were collected and tested using the proposed model, which is generated

5 during the SVM training process. Each class was tested against the proposed model of that respective class. Table 5 shows the test result for 150 random sequences using proposed model. Table 5 Test result for 150 random sequences using proposed model Classifier Correctly Identified (Out of 150) Accuracy (%) Mainly α Mainly β α β Fss Average Accuracy Future Development So far for the characteristics extraction for each topology we ve divided each protein sequence of topologies into three equal parts. For betterment of result in this sector we can divide each protein sequence into three variable parts. For making the candidate list for the representative of each topology and choosing the representative from the candidate list we can create a more complex decision model. For the feature extraction of each structural division we created and compared with a list of all possible amino acid sequence of length four. Using this length as five may increase the result quality. 6 Conclusion The results presented in this paper are the representatives of the topologies. From the result analysis, it is proved that these 1110 topologies will represent the whole protein world more accurately. These representatives can be used as benchmark like RS (126) and CB (513). So that each structure prediction method can be trained by these representatives and compare their result with each other. CATH database using SVM approach, no result or data is found regarding the prediction of random protein sequences, so that it is not possible to compare the proposed result of prediction with the existing one. But so far it is found that from any other prediction method, the proposed method will be able to accurately generate the secondary structure of protein at a satisfactory level (more than the existing accuracy of near about 60 70%). In addition to that the computation time will be reduced. References [1] Web URL: [2] Rost B. and Sander C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. J. Mol. Biol., 232, [3] Frishman D. and Argos P. (1996) Incorporation of nonlocal interactions in protein secondary structure prediction from the amino acid sequence. Protein Engineering, 2, [4] Cuff, J. A. and Barton G. J. (1999) Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction. Proteins, 34, [5] Murzin A.G., Brenner S. E., Hubbard T. J. P. and Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol., 247, [6] Brenner S. E., Koehl P. and Levitt M. (2000) The ASTRAL Compendium for Protein Structure and Sequence Analysis, Nucleic Acids Res., 28, [7] Frishman D. and Argos P. (1995) Knowledge-based protein secondary structure assignment, Proteins, 23, [8] Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM, CATH-A Hierarchic Classification of Protein Domain Structures, Structure, 1997, 5: [9] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, Software available at [10] Minh N. Nguyen, Jagath C. Rajapakse, Prediction of Protein Secondary Structure with two-stage multi-class SVMs, Genome Informatics 14, 2003, pp [11] KreBel, U., Pairwise classifcation and support vector machines, In Advances in Kernel Methods- Support Vector Learning, Cambridge, 1999, MA: MIT Press, pp: [12] X.-D. Sun and R.-B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids (2006),Volume 30, Number 4 / June, 2006: [13] Rost B. and Sander C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. J. Mol. Biol., 232, [14] Frishman D. and Argos P. (1996) Incorporation of nonlocal interactions in protein secondary structure prediction from the amino acid sequence. Protein Engineering, 2, [15] Liu Y., Carbonell J., Klein-Seetharaman J. and Gopalakrishnan V. (2004) Context Sensitive Vocabulary And its Application in Protein Secondary Structure Prediction, SIGIR 04, Sheffield, South Yorkshire UK. [16] Ward J. J., McGuffin L. J., Buxton B. F. and Jones D. T. (2003) Secondary structure prediction with support vector machines. Bioinformatics, 13,

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings

More information

Efficient Remote Homology Detection with Secondary Structure

Efficient Remote Homology Detection with Secondary Structure Efficient Remote Homology Detection with Secondary Structure 2 Yuna Hou 1, Wynne Hsu 1, Mong Li Lee 1, and Christopher Bystroff 2 1 School of Computing,National University of Singapore,Singapore 117543

More information

A General Model for Amino Acid Interaction Networks

A General Model for Amino Acid Interaction Networks Author manuscript, published in "N/P" A General Model for Amino Acid Interaction Networks Omar GACI and Stefan BALEV hal-43269, version - Nov 29 Abstract In this paper we introduce the notion of protein

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure The Scientific World Journal Volume 2013, Article ID 347106, 8 pages http://dx.doi.org/10.1155/2013/347106 Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure Yin-Fu

More information

Protein Structure Prediction using String Kernels. Technical Report

Protein Structure Prediction using String Kernels. Technical Report Protein Structure Prediction using String Kernels Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

STRUCTURAL BIOLOGY AND PATTERN RECOGNITION

STRUCTURAL BIOLOGY AND PATTERN RECOGNITION STRUCTURAL BIOLOGY AND PATTERN RECOGNITION V. Cantoni, 1 A. Ferone, 2 O. Ozbudak, 3 and A. Petrosino 2 1 University of Pavia, Department of Electrical and Computer Engineering, Via A. Ferrata, 1, 27, Pavia,

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

Protein Structure and Function Prediction using Kernel Methods.

Protein Structure and Function Prediction using Kernel Methods. Protein Structure and Function Prediction using Kernel Methods. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Huzefa Rangwala IN PARTIAL FULFILLMENT OF THE

More information

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Burkhard Rost and Chris Sander By Kalyan C. Gopavarapu 1 Presentation Outline Major Terminology Problem Method

More information

Improved Protein Secondary Structure Prediction

Improved Protein Secondary Structure Prediction Improved Protein Secondary Structure Prediction Secondary Structure Prediction! Given a protein sequence a 1 a 2 a N, secondary structure prediction aims at defining the state of each amino acid ai as

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

Analysis and Prediction of Protein Structure (I)

Analysis and Prediction of Protein Structure (I) Analysis and Prediction of Protein Structure (I) Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 2006 Free for academic use. Copyright @ Jianlin Cheng

More information

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB) Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Heteropolymer. Mostly in regular secondary structure

Heteropolymer. Mostly in regular secondary structure Heteropolymer - + + - Mostly in regular secondary structure 1 2 3 4 C >N trace how you go around the helix C >N C2 >N6 C1 >N5 What s the pattern? Ci>Ni+? 5 6 move around not quite 120 "#$%&'!()*(+2!3/'!4#5'!1/,#64!#6!,6!

More information

Conditional Graphical Models

Conditional Graphical Models PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing

More information

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and

More information

Substitution Matrix based Kernel Functions for Protein Secondary Structure Prediction

Substitution Matrix based Kernel Functions for Protein Secondary Structure Prediction Substitution Matrix based Kernel Functions for Protein Secondary Structure Prediction Bram Vanschoenwinkel Vrije Universiteit Brussel Computational Modeling Lab Pleinlaan 2, 1050 Brussel, Belgium Email:

More information

Basics of protein structure

Basics of protein structure Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu

More information

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map American Journal of Biochemistry and Biotechnology 4 (4): 375-384, 8 ISSN 553-3468 8 Science Publications Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures

More information

Protein structure similarity based on multi-view images generated from 3D molecular visualization

Protein structure similarity based on multi-view images generated from 3D molecular visualization Protein structure similarity based on multi-view images generated from 3D molecular visualization Chendra Hadi Suryanto, Shukun Jiang, Kazuhiro Fukui Graduate School of Systems and Information Engineering,

More information

ALL LECTURES IN SB Introduction

ALL LECTURES IN SB Introduction 1. Introduction 2. Molecular Architecture I 3. Molecular Architecture II 4. Molecular Simulation I 5. Molecular Simulation II 6. Bioinformatics I 7. Bioinformatics II 8. Prediction I 9. Prediction II ALL

More information

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008 Journal Club Jessica Wehner Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008 Comparison of Probabilistic Combination Methods for Protein Secondary Structure

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Uros Midic 1 A. Keith Dunker 2 Zoran Obradovic 1* 1 Center for Information Science and Technology Temple

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Better Bond Angles in the Protein Data Bank

Better Bond Angles in the Protein Data Bank Better Bond Angles in the Protein Data Bank C.J. Robinson and D.B. Skillicorn School of Computing Queen s University {robinson,skill}@cs.queensu.ca Abstract The Protein Data Bank (PDB) contains, at least

More information

Genome Databases The CATH database

Genome Databases The CATH database Genome Databases The CATH database Michael Knudsen 1 and Carsten Wiuf 1,2* 1 Bioinformatics Research Centre, Aarhus University, DK-8000 Aarhus C, Denmark 2 Centre for Membrane Pumps in Cells and Disease

More information

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1 Last time 2 3 Yet another

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES by Lipontseng Cecilia Tsilo A thesis submitted to Rhodes University in partial fulfillment of the requirements for

More information

Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction. M.N. Nguyen and J.C. Rajapakse

Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction. M.N. Nguyen and J.C. Rajapakse Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction M.N. Nguyen and J.C. Rajapakse Pacific Symposium on Biocomputing 10:346-357(2005) TWO-STAGE MULTI-CLASS SUPPORT VECTOR

More information

Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases

Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases Sliding helices and strands in structural comparisons 921 Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases V S GOWRI, K ANAMIKA, S GORE 1 and

More information

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Secondary Structure Prediction using Feed-Forward Neural Network COPYRIGHT 2010 JCIT, ISSN 2078-5828 (PRINT), ISSN 2218-5224 (ONLINE), VOLUME 01, ISSUE 01, MANUSCRIPT CODE: 100713 Protein Secondary Structure Prediction using Feed-Forward Neural Network M. A. Mottalib,

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space

The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space Eleazar Eskin Department of Computer Science Engineering University of California, San Diego eeskin@cs.ucsd.edu Sagi

More information

Protein tertiary structure prediction with new machine learning approaches

Protein tertiary structure prediction with new machine learning approaches Protein tertiary structure prediction with new machine learning approaches Rui Kuang Department of Computer Science Columbia University Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) NEC

More information

Protein structure analysis. Risto Laakso 10th January 2005

Protein structure analysis. Risto Laakso 10th January 2005 Protein structure analysis Risto Laakso risto.laakso@hut.fi 10th January 2005 1 1 Summary Various methods of protein structure analysis were examined. Two proteins, 1HLB (Sea cucumber hemoglobin) and 1HLM

More information

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition Iain Melvin 1,2, Eugene Ie 3, Rui Kuang 4, Jason Weston 1, William Stafford Noble 5, Christina Leslie 2,6 1 NEC

More information

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity 1 frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity HUZEFA RANGWALA and GEORGE KARYPIS Department of Computer Science and Engineering

More information

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society 1 of 5 1/30/00 8:08 PM Protein Science (1997), 6: 246-248. Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society FOR THE RECORD LPFC: An Internet library of protein family

More information

PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM

PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM 43 1 PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM Feng Jiao School of Computer Science, University of Waterloo, Canada fjiao@cs.uwaterloo.ca Jinbo Xu Toyota Technological Institute at Chicago,

More information

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries Betty Yee Man Cheng 1, Jaime G. Carbonell 1, and Judith Klein-Seetharaman 1, 2 1 Language Technologies

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information

#33 - Genomics 11/09/07

#33 - Genomics 11/09/07 BCB 444/544 Required Reading (before lecture) Lecture 33 Mon Nov 5 - Lecture 31 Phylogenetics Parsimony and ML Chp 11 - pp 142 169 Genomics Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33

More information

Protein Fold Recognition Using Gradient Boost Algorithm

Protein Fold Recognition Using Gradient Boost Algorithm Protein Fold Recognition Using Gradient Boost Algorithm Feng Jiao 1, Jinbo Xu 2, Libo Yu 3 and Dale Schuurmans 4 1 School of Computer Science, University of Waterloo, Canada fjiao@cs.uwaterloo.ca 2 Toyota

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Interpolation and Polynomial Approximation I

Interpolation and Polynomial Approximation I Interpolation and Polynomial Approximation I If f (n) (x), n are available, Taylor polynomial is an approximation: f (x) = f (x 0 )+f (x 0 )(x x 0 )+ 1 2! f (x 0 )(x x 0 ) 2 + Example: e x = 1 + x 1! +

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Protein structure alignments

Protein structure alignments Protein structure alignments Proteins that fold in the same way, i.e. have the same fold are often homologs. Structure evolves slower than sequence Sequence is less conserved than structure If BLAST gives

More information

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction Journal of Bioinformatics and Computational Biology c Imperial College Press Profile-based string kernels for remote homology detection and motif extraction Rui Kuang 1, Eugene Ie 1,3, Ke Wang 1, Kai Wang

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,

More information

Protein Structure Prediction and Display

Protein Structure Prediction and Display Protein Structure Prediction and Display Goal Take primary structure (sequence) and, using rules derived from known structures, predict the secondary structure that is most likely to be adopted by each

More information

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2

More information

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES By Feng Gao A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for

More information

Bioinformatics. Macromolecular structure

Bioinformatics. Macromolecular structure Bioinformatics Macromolecular structure Contents Determination of protein structure Structure databases Secondary structure elements (SSE) Tertiary structure Structure analysis Structure alignment Domain

More information

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

NRProF: Neural response based protein function prediction algorithm

NRProF: Neural response based protein function prediction algorithm Title NRProF: Neural response based protein function prediction algorithm Author(s) Yalamanchili, HK; Wang, J; Xiao, QW Citation The 2011 IEEE International Conference on Systems Biology (ISB), Zhuhai,

More information

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

Protein Complex Identification by Supervised Graph Clustering

Protein Complex Identification by Supervised Graph Clustering Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda Balem 2, Christos Faloutsos 1, Judith Klein- Seetharaman 1,2, Ziv Bar-Joseph 1 1 School of Computer Science, Carnegie

More information

MSAT a Multiple Sequence Alignment tool based on TOPS

MSAT a Multiple Sequence Alignment tool based on TOPS MSAT a Multiple Sequence Alignment tool based on TOPS Te Ren, Mallika Veeramalai, Aik Choon Tan and David Gilbert Bioinformatics Research Centre Department of Computer Science University of Glasgow Glasgow,

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Analysis of N-terminal Acetylation data with Kernel-Based Clustering Analysis of N-terminal Acetylation data with Kernel-Based Clustering Ying Liu Department of Computational Biology, School of Medicine University of Pittsburgh yil43@pitt.edu 1 Introduction N-terminal acetylation

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

A new prediction strategy for long local protein. structures using an original description

A new prediction strategy for long local protein. structures using an original description Author manuscript, published in "Proteins Structure Function and Bioinformatics 2009;76(3):570-87" DOI : 10.1002/prot.22370 A new prediction strategy for long local protein structures using an original

More information

Protein Structure & Motifs

Protein Structure & Motifs & Motifs Biochemistry 201 Molecular Biology January 12, 2000 Doug Brutlag Introduction Proteins are more flexible than nucleic acids in structure because of both the larger number of types of residues

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT (): 195-04 (008) Radial Basis Function Neural Networks in Protein Sequence Classification Zarita Zainuddin and Maragatham Kumar School of Mathematical Sciences, University Science Malaysia, 11800 USM Pulau

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btl642

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btl642 Vol. 23 no. 9 2007, pages 1090 1098 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl642 Structural bioinformatics A structural alignment kernel for protein structures Jian Qiu 1, Martial Hue

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007 Molecular Modeling Prediction of Protein 3D Structure from Sequence Vimalkumar Velayudhan Jain Institute of Vocational and Advanced Studies May 21, 2007 Vimalkumar Velayudhan Molecular Modeling 1/23 Outline

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction Institute of Bioinformatics Johannes Kepler University, Linz, Austria Chapter 4 Protein Secondary

More information

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 3 (2017) pp. 461-469 Research India Publications http://www.ripublication.com Predictive Analytics on Accident Data Using

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

Homology and Information Gathering and Domain Annotation for Proteins

Homology and Information Gathering and Domain Annotation for Proteins Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The

More information

The CATH Database provides insights into protein structure/function relationships

The CATH Database provides insights into protein structure/function relationships 1999 Oxford University Press Nucleic Acids Research, 1999, Vol. 27, No. 1 275 279 The CATH Database provides insights into protein structure/function relationships C. A. Orengo, F. M. G. Pearl, J. E. Bray,

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information