Identification of Essential Residues from Protein Secondary Structure Prediction and Applications

Size: px
Start display at page:

Download "Identification of Essential Residues from Protein Secondary Structure Prediction and Applications"

Transcription

1 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan 1 Introduction Proteins can perform functions when they fold into proper three-dimensional structures. However, since determining the structure of a protein through wet-lab experiments can be time-consuming and laborintensive, computational approaches are preferable. To characterize the structural topology of proteins, Linderstrøm-Lang proposed the concept of a protein structure hierarchy with four levels: primary, secondary, tertiary, and quaternary. In the hierarchy, protein secondary structure (PSS) plays an important role in analyzing and modeling protein structures, since it represents the local conformation of amino acids into regular structures. There are three basic secondary structure elements (SSEs): α-helices (H), β-strands (E), and coils (C). Many researchers employ PSS as a feature to predict the tertiary structure (Fischer et al., 2001, Gong & Rose, 2005, Meiler & Baker, 2003, Rost, 2001), function (Aydin et al., 2006, Eisner et al., 2005, Ferre & King, 2006, Laskowski et al., 2005), or subcellular localization (Nair & Rost, 2003, Nair & Rost, 2005, Su et al., 2007) of proteins. It is noteworthy that, among the various features used to predict protein function, such as amino acid composition, disorder patterns, and signal peptides, PSS makes the largest contribution (Lobley et al., 2007). Moreover it has been suggested that secondary structure alone may be sufficient for accurate prediction of a protein s tertiary structure (Przytycka et al., 1999). The majority of methods based on predicted secondary structure use the entire sequence for the prediction of this feature. However, individual residues in a given protein are not equally important; some are essential for folding, structural stability, and function of the protein, whereas others can be readily replaced (Capra & Singh, 2007). Moreover, many proteins contain unstructured regions which do not adopt well-ordered 3D structures (Schlessinger et al., 2007), making their secondary structure status unpredictable. The goals of this work include the identification of essential residues within a given protein sequence and their potential use for improvement of performance of other protein prediction problems. In particular, we explored whether it is more advantageous to use structural information using only the predicted essential residues or the entire sequence for prediction of PSS. In a previous work on PSS prediction (Lin et al., 2005), we proposed a method called PROSP, which utilizes a sequence-structure knowledge base to predict a query protein s secondary structure. The knowledge base consists of sequence fragments, each of which is associated with a corresponding structure profile. The structure profile is a position specific scoring matrix that indicates the frequency of each SSE at each position. The average Q 3 score of PROSP is approximately 75%.

2 78 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications In this study we have modified the knowledge base and introduced a new prediction method, call SymPred, to identify of the essential residues. Further we applied SymPred for the prediction of the PSS, and developed a protein function classification method, called ProtoPred. We used ProtoPred to have explored the role of essential residues in protein function predictions, enzyme/non-enzyme classification, and prediction of unstructured regions of proteins. The major differences between SymPred and PROSP are as follows. First, the constitutions of the knowledge bases are different. Second, the scoring systems of SymPred and PROSP are different. Third, unlike PROSP, SymPred allows inexact matching. Our experiment results show that SymPred can achieve 81.0% Q 3 accuracy on a non-redundant dataset, which represents a 5.9% performance improvement over PROSP. Current PSS prediction methods can be classified into two categories: template-based methods and sequence profile-based methods (Bondugula & Xu, 2007). Template-based methods use protein sequences of known secondary structures as templates, and predict PSS by finding alignments between a query sequence and sequences in the template pool. The nearest-neighbor method belongs to this category. It uses a database of proteins with known structures to predict the structure of a query protein by finding nearest neighbors in the database. By contrast, sequence profile-based methods (or machine learning methods) generate learning models to classify sequence profiles into different patterns. In this category, Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) are the most widely used machine learning algorithms (Ceroni et al., 2003, Cheng et al., 2005, Jones, 1999, Karplus et al., 1998, Kim & Park, 2003, Rost & Sander, 2000, Ward et al., 2003). Templatebased methods are highly accurate if there is a sequence similarity above a predefined threshold between the query and some of the templates; otherwise, sequence profile-based methods are more reliable. However, the latter may under-utilize the structural information in the training set when the query protein has some sequence similarity to a template in the training set (Bondugula & Xu, 2007). An approach that combines the strengths of both types of methods is required for generating reliable predictions despite the query sequence is similar or dissimilar to the templates in the training set. There are significant differences between SymPred and other methods in the two categories. First, in contrast to template-based methods, SymPred does not generate a sequence alignment between the query protein and the template proteins. Instead, it finds templates by using local sequence similarities and their possible variations. Second, SymPred is not a machine learning based approach. Moreover, it does not use a sequence profile, so it cannot be classified into the second category. However, like machine learningbased approaches, SymPred can capture local sequence similarities and generate reliable predictions as well as the essential residues of protein sequences. The experiment results on the two latest independent test sets (EVA_Set1 and EVA_Set2) show that, in terms of Q 3 accuracy, SymPred outperforms other existing methods by 1.4% to 5.4%. 2 Methods 2.1 Structure Conservation It is well known that a protein structure is encoded and determined by its amino acid sequence. However, the folding process is perhaps the most fundamental unresolved question in biology (Alexander et al., 2007). Fortunately, biologists have found that two proteins with a sequence identity above 40% may have a similar structure and function. The high degree of robustness of the structure with respect to the sequence variation shows that the structure is more conserved than the sequence.

3 H.N. Lin, T.Y. Sung & W.L. Hsu 79 In evolutionary biology, protein sequences that derive from a common ancestor can be traced on the basis of sequence similarity. Such sequences are referred to homologous proteins. In fact, the most successful approaches for predicting protein structures involve the detection of homologous proteins of known structures. These methods rely on the observation that the number of folds appears to be limited and homologous proteins adopt remarkably similar structures (Kelley & Sternberg, 2009). However, some homologous proteins are difficult to identify due to the low sequence identity and they are referred to remotely homologs. For example, the sequence identity between proteins 1aab and 1j46 is only 16.7% but they are structurally homologous and classified into the same family (HMG-box) in the SCOP classification. In such a case, it is difficult to discover the homologous relationship using sequence comparison methods. Profile-profile alignment methods (Pietrokovski, 1996, Rychlewski et al., 2000, Sadreyev & Grishin, 2003, Przybylski & Rost, 2007, Yona & Levitt, 2002) are capable of identifying remote homology; nevertheless, they are relatively slow. In this study, rather than using global sequence similarities we compile a knowledge base by using local sequence similarities to find useful proteins as templates for structure prediction. We demonstrate that local sequence similarity also possesses structure conservation and it is helpful to predict protein structure. 2.2 Local Sequence Similarity and Similar Peptides The local sequence similarity between a pair of sequences represents a sequence similarity that arises within a part of the two sequences, which can be identified by using alignment algorithms such as that of Smith and Waterman (Smith & Waterman, 1981). However, it can not guarantee a structural relationship in a local sequence similarity (Sternberg & Islam, 1990). To assess whether a local sequence similarity implies a homologous relationship, one should estimate the significance of the alignment. Statistically, when performing a PSI-BLAST search (Altschul et al., 1997) for a given query protein, an e-value of generally produces a safe searching and signifies sequence homology (Jones & Swindells, 2002). Therefore, we assume that two sequence fragments probably have a homologous relationship if they can be aligned by PSI-BLAST with an e-value of at least. Based on the assumption, we can further generate similar peptides from a significant alignment between two sequences. Given a query protein sequence p, PSI-BLAST would probably generate a large number of significant local pairwise alignments called high-scoring segment pairs (HSPs) between p and its similar proteins sp. Figure 1 shows an example of HSP with an e-value of In the alignment, the identical residues are labeled with letters and conserved substitutions are labeled with plus symbols. The sequence identity between the two sequence fragments in this example is 50% (=20/40). We use a sliding window of size n to scan each HSP and define the similar peptides for p. Given an n-gram pattern w in p, the similar peptide of w is defined as the other n-gram pattern sw in sp within the sliding window that is aligned with w and both of them are without any gaps. Take the sequence alignment in Figure 1 as an example. The Sbjct sequence is a similar protein to the Query sequence; therefore, DFDM is deemed similar to the peptide EWQL if the window size is 4, and FDMV is deemed similar to the next peptide WQLV. Based on the observation of high robustness of structures, if the Query is of known structure and the Sbjct is of unknown structure, we assume that each similar peptide sw adopts the same structure as its corresponding peptide w; i.e., sw inherits the structure of w. Moreover, different similar peptides sw for a given peptide w should have different similarity scores to w. To estimate the similarity between w and sw, we calculate the similarity level according to the number of amino acid pairs that are interchangeable. If two amino acids are aligned in a sequence alignment, they are said to be interchangeable if they have a positive score in BLOSUM62. Since in this

4 80 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications study a peptide is defined as an n-gram pattern, the range of the similarity level between the components of a peptide pair is from 0 to n. For example, the similarity level between DFDM and EWQL is 3, and that between FDMV and WQLV is also 3. In this study, we compile a knowledge base, called SPKB, which stands for Similar Peptide Knowledge Base. In this knowledge base, we collect a large amount of similar peptides for each protein sequence of known structure in a dataset. We transfer the structural information to the corresponding similar peptides and use the information as the knowledge of sequence to structure for PSS prediction. Figure 1: A local sequence alignment derived by BLAST. The identical residues are labeled with letters and conserved substitutions are labeled with plus symbols. The alignment in this example shows that the sequence fragment from position 7 to position 46 of the query sequence is very similar to that from position 3 to position 42 in the subject sequence. 2.3 Construction of the Knowledge Base SPKB Given a query sequence, we use PSI-BLAST to generate a number of significant alignments, from which we acquire possible sequence variations. In general, the similar protein sequences (i.e., the Sbjct sequences) reported by PSI-BLAST share highly similar sequence identities (between 25% and 100%) with the query, which implies that the sequences may have similar structures. Therefore, we identify similar peptides in those sequences. Using a dataset of protein sequences with known secondary structures, we construct a knowledge base, called SPKB. The dataset used to construct SPKB is described later in the Result section. For each protein p in the dataset, we first extract n-gram patterns from its original sequence using a sliding window of size n. Each n-gram pattern, as well as the corresponding SSEs of the successive n residues, the protein source p, and the similarity level (here, the similarity level is n), are stored as an entry in SPKB. A protein source p represents the structural information provider. We then use PSI-BLAST to generate a number of similar protein sequences. Specifically, to find similar sequences, we perform a PSI-BLAST search of the NCBInr database with parameters j=3, b=500, and e=0.001 for each protein p in the dataset. Since the NCBInr database only contains protein sequence information, each similar peptide inherits the SSEs of its corresponding word in p. A PSI-BLAST search for a specific query protein p generates a number of local pairwise sequence alignments between p and its similar proteins. Statistically, an e-value of generally produces a safe search and signifies sequence homology (Jones & Swindells, 2002). Similarly, each similar peptide and its inherited structure, the protein source p, and the similarity level are stored as an entry in SPKB.

5 H.N. Lin, T.Y. Sung & W.L. Hsu 81 Figure 2: The procedure used to extract n-grams and similar peptides for a query protein p. The procedure used to extract n-grams and their similar peptides for a given query protein p (assuming the window size n is 4). We use a sliding window to screen the query sequence and all the similar protein sequences found by PSI-BLAST and extract all n-gram peptides. Each n-gram is associated with a piece of structural information of the region from which it is extracted. The protein source of all the extracted n-gram is the query protein p, since all the structural information is derived from p. Figure 2 shows the procedure used to extract similar peptides for a query protein p. We use a sliding window to screen the query sequence, as well as all the similar protein sequences found by PSI-BLAST, and extract all n-grams. The query protein p is the protein source of all the extracted n-gram patterns. Each pattern is associated with a piece of structural information according to the region from which it is extracted. For example, WGPV is a similar peptide of WAKV. Since it is from a similar protein of unknown structure, it is associated with a piece of structural information of WAKV, which is HHHH. Note that a similar peptide may appear in more than one similar protein when all similar protein sequences are screened. We cluster identical n-gram patterns together and store the frequency in the similar peptide entry. Table 1 shows an example of a similar peptide entry in SPKB. In the example, WGPV is a similar peptide of proteins A, B and C, since it is extracted from the similar proteins of A, B and C. The similar peptide inherits the corresponding structural information of its source, and we can derive the corresponding similarity levels and frequencies via the extraction procedure. For example, the similarity level of WGPV in terms of protein source A is 3 and the frequency is 7. This implies that WGPV has 3 interchangeable amino acids with the corresponding protein word of A and it appears 7 times among the similar proteins of A found in the PSI-BLAST search result. Similar peptide: WGPV Protein Source Secondary Structure Similarity Level Frequency A HHHH 3 7 B HHCH 4 11 C CHHH 2 3 Table 1: An example of a similar peptide entry in SPKB (assuming the window size n is 4). WGPV is a similar peptide of proteins A, B and C, since it is extracted from the similar proteins of A, B and C. We transfer the structural information of protein sources to the corresponding similar peptides, and calculate the corresponding similarity levels and frequencies. For example, the similarity level of WGPV in terms of protein source A is 3 and the frequency is 7.

6 82 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications 2.4 SymPred: A PSS Predictor Based On SPKB Preprocessing Given a target protein t, whose secondary structure is unknown and to be predicted, we perform a PSI- BLAST search on t to compile a set of n-gram patterns containing its original n-grams and similar peptides. The procedure is similar to the construction of SPKB. We also calculate the frequency and similarity level of each n-gram in the peptide set The Scoring Function Each peptide w in the set is used to match against similar peptides in SPKB, and the structural information of each protein source in the matched entry is used to vote for the secondary structure of t. To differentiate the effectiveness of matched entries, we design a scoring function based on the protein sources in the matched entries and the sum of the weighted scores on the associated structures determines the predicted structure. Since we use the structural information of protein sources in the matched entries for structure prediction, we define the scoring function based on its similarity level and frequency recorded in the dictionary for the following observation. The similarity level represents the degree of similarity between a protein n-gram pattern and its similar peptide, and the frequency represents the degree of sequence conservation in the protein s evolution. Intuitively, the greater the similarity between two peptides, the closer they are in terms of evolution; likewise, the more frequently a peptide appears in a group of similar proteins, the more conserved it is in terms of evolution. To define the scoring function, we consider the similarity level and the frequency of the peptide in the peptide set of t, denoted by Sim t and freq t respectively, as well as those of a protein source i in its matched entry, denoted by Sim i and freq i respectively. Note that sim t and freq t are obtained in the preprocessing stage. To measure the effectiveness of the structural information of the protein source i, we define the voting score s i as min(freq t, freq i ) (1+min(Sim t, Sim i )). The structural information provided by i will be highly effective if: 1) w is very similar to the corresponding peptides of t and i; and 2) w is well conserved among the similar proteins of t and i. Take the similar peptide WGPV in Table 1 as an example. If WGPV is a similar peptide of t (assuming freq t is 5 and Sim t is 4), then the voting score of the structural information provided by protein source A is min(5, 7) (1+min(4, 3)) = 5 (1+3) = 20. Similarly, the voting score provided by protein source B is min(5, 11) (1+min(4, 4)) = 5 (1+4) = 25, and the score provided by protein source C is min(5, 3) (1+min(4, 2)) = 3 (1+2) = 9. The structural information provided by protein source B has highest score in this matched entry and therefore has the most effect on the prediction Structure Determination The final structure prediction of the target protein t is determined by summing the voting scores of all the protein sources in the matched entries. Specifically, for each amino acid in a protein t, we associate three variables, H(x), E(x), and C(x) which correspond to the total voting scores for the amino acid x to have structures H, E, and C, respectively. For example, assume that the above similar peptide WGPV is aligned with the residues of protein t starting at position 11. Then, protein A s contribution to the voting score of H(11), H(12), H(13), and H(14) will be 20. Similarly, protein B will contribute a voting score of 25 to H(11), H(12), C(13), and H(14); and protein C will contribute a voting score of 9 to C(11), H(12), H(13), and H(14). The structure at x is predicted to be H, E or C depending on max(h(x), E(x), C(x)). When two

7 H.N. Lin, T.Y. Sung & W.L. Hsu 83 or more variables have the same highest voting score, C has a higher priority than H, and H has a higher priority than E Confidence Level A confidence measure on a prediction for each residue is important to a PSS predictor, because it reflects the reliability on the output of the predictor. To evaluate the prediction confidence on each amino acid x, we calculate a confidence level to measure the reliability of the prediction. The confidence level on amino acid x is defined as follows. The range of ConLvl(x) is forced to be between 0 and 9 by rounding down. In the Results section, we will analyze the correlation coefficient between the confidence level and the average Q 3 score, and compare with that of PSIPRED. 2.5 SymPsiPred: A Secondary Structure Meta-predictor SymPred is different from sequence profile-based methods, such as PSIPRED, which is currently the most popular PSS prediction tool. PSIPRED achieved the top average Q 3 score of 80.6% in the 20 methods evaluated in the CASP4 competition. SymPred and PSIPRED use totally different features and methodologies to predict the secondary structure of a query protein. Specifically, SymPred relies on synonymous words, which represent local similarities among protein sequences and their homologies; however, PSIPRED relies on a position specific scoring matrix (PSSM) generated by PSI-BLAST, which is a condensed representation of a group of aligned sequences. Furthermore, SymPred constructs a proteindependent synonymous dictionary for inquiries about structural information. In contrast, PSIPRED builds a learning model based on a two-stage neural network to classify sequence profiles into a vector space; thus, it is a probabilistic model of structural types. To combine the results derived by the two methods, we compare the prediction confidence level of each residue from each method and return the structure with the higher confidence. Since SymPred and PSIPRED use different measures for the confidence levels, we transform their confidence levels into Q 3 scores. For each method, we generate an accuracy table showing the average Q 3 score for each confidence level, i.e., we use the average Q 3 score of an SSE to reflect the prediction confidence. For example, suppose SymPred predicts that a residue in a target sequence has structure H with a confidence level of 6, PSIPRED predicts that the residue has structure E with a confidence level of 6, and the corresponding Q 3 scores in the accuracy tables are 77.6% and 64.6% respectively. In this case, SymPsiPred would predict the residue as H. 2.6 Essential Residues ConLvl( x) 10 H( x) E( x) C( x) ( freqt freqi ) ( Simt max 1, 2 i, t 2 Simi ) Since the confidence level measures the ratio of voting scores a residue x gets to the summation of the normalization factors, it reflects the degree of sequence conservation in protein evolution. We use the confidence levels representing the degrees of importance of residues in determining the structure and function of a protein sequence.

8 84 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications To study the effectiveness of essential residues, we developed a general prediction method, called ProtoPred, which only uses the secondary structural information as the single feature for general proteome prediction problems, such as function prediction and enzyme/non-enzyme classification. The confidence levels are used as weights to indicate the degrees of importance of residues when finding protein templates for the prediction. 2.7 ProtoPred: A Prototype of Prediction Method Figure 3 shows the main algorithm of ProtoPred. ProtoPred is a simple template based method for general prediction problems. It is a standard query-template alignment algorithm that is used frequently in homology modeling or threading methods (Laskowski et al., 2005, Pandey et al., 2006, Frenkel- Morgenstern et al., 2005). (a) Template Extraction (b) ProtoPred s Prediction Procedure Figure 3: The main algorithm of ProtoPred. (a) Template extraction (b) the prediction procedure. For the training of ProtoPred, we used a sliding window of size w to extract the real secondary structure fragments from each of the training proteins. Each structure fragment carried the related information from its origin, such as function labels or protein classes. These fragments were treated as templates for predictions. For test phase we used the same sliding window to extract the predicted secondary structure fragments from the target protein. Each structure fragment (denoted as s) was used to search against the template pool. We compared the similarities between s and each template t in the template pool. The similarity was estimated as follows.

9 H.N. Lin, T.Y. Sung & W.L. Hsu 85 For each position x (from 1 to w) if s(x) was identical to t(x), then t would get a weighted score from s, i.e., the confidence level of s(x). Each s selects the best template t with the highest sum of weighted scores (denoted as Sum ws ). If the best template t was labeled as class A, then the target protein would get a score of Sum ws for class A. Finally, the target protein would be predicted as the class with the highest score. 3 Results 3.1 Datasets and Experiment Design We downloaded all the protein files in the DSSP database and compiled subsets of protein chains based on the thresholds of different sequence identities. Then, we generated two datasets, called DsspNr-25 and DsspNr-60 with the PSI-CD-HIT program (Li & Godzik, 2006). DsspNr-25 contains 8297 protein chains, whose mutual sequence identities are below 25%; DsspNr-60 contains protein chains, whose mutual sequence identities are below 60%. We use DsspNr-25 as the validation set to evaluate the number of residues correctly predicted Q number of all residues 3 100% performance of SymPred and SymPsiPred, and use DsspNr-60 as the template proteins to construct the SPKB. To predict a target protein t in the test set, we ignore the structural information of protein p in the template pool if p and t share at least 25% of the sequence identity, which we calculate with the ClustalW tool. To evaluate our methods, we use the Q 3 score to measure the performance of PSS prediction methods. In our approach, the Q 3 score is calculated as follows: We measure the Q 3 score for each protein in DsspNr-25 and take the average Q3 score as the final result to estimate the performance of SymPred and SymPsiPred. 3.2 Performance Evaluation of SymPred and SymPsiPred on DsspNr-25 Table 2 shows the prediction performance of SymPred, PSIPRED, PROSP, and SymPsiPred on the DsspNr-25 dataset. SymPred achieved Q 3 of 81.0% and SOV of 76.0%, outperforming PROSP by 5.9% in Q 3 and 7.3% in SOV. The meta-predictor, SymPsiPred which integrates the prediction power of SymPred and PSIPRED, achieved a further improvement on Q 3 of 83.9% on DsspNr-25. This result demonstrates that SymPsiPred can combines the strengths of the two methods and thus yield much more accurate predictions. DsspNr-25 Q 3 Q 3 Ho Q 3 Eo Q 3 Co sov sovh sove sovc SymPred PSIPRED SymPsiPred PROSP Table 2: The prediction performance of SymPred, PSIPRED, PROSP, and SymPsiPred. Q3Ho (Q3Eo and Q3Co, respectively) represents correctly predicted helix (strand and coil, respectively) residues (percentage of helix observed). sovh/e/c values are the specific SOV accuracies of the predicted helix, strand and coil, respectively.

10 86 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications 3.3 Evaluation of the confidence level Figure 4 shows the utility of our confidence level and PSIPRED s confidence level in judging the prediction accuracy of each residue in the test set. The statistics are based on more than 2 million residues. The correlation coefficient between the confidence levels and Q 3 scores for SymPred is 0.992, and that for PSIPRED is Thus, both methods provide strong confidence measures for the output. We observe that a confidence level of 7 or above reported by SymPred is attributed to 53% of the residues with more than 81% of the Q 3 accuracy which is comparable to the confidence level of 8 or above reported by PSIPRED. Furthermore, it can be observed that the prediction of SymPred is more reliable when the confidence levels of both methods are low. For example, the average Q 3 score of SymPred for the confidence level of 6 is 77.6%, whereas that of PSIPRED is 64.6%. 3.4 Performance Comparison with Existing Methods on EVA Benchmark Datasets EVA test sets usually serve as benchmarks of protein secondary structure predictors, particular for CASP competitions. Only proteins without significant sequence identity to previously known PDB proteins were used to test on different existing methods. We chose two latest EVA sequence-unique subsets of the PDB, called EVA_Set1 (protein list: and EVA_Set2 (protein list: the former containing 80 proteins tested on the most number of methods and the latter with the maximum number of proteins (212 proteins). The two datasets serve as independent test sets for performance comparison of SymPred with other existing methods. Figure 4: Comparison of the Q 3 scores and confidence levels derived from SymPred and PSIPRED. The correlation coefficient between the confidence levels and Q 3 scores for SymPred is 0.992, and that for PSIPRED is For fair comparison, when predicting the secondary structure of each target protein in an independent set, SymPred discarded the structural information of all proteins sharing at least 25% of the sequence identity with the target protein in the template pool, i.e., SymPred used in the template pool the structural information of proteins sharing no more than 25% sequence identity with the target protein.

11 H.N. Lin, T.Y. Sung & W.L. Hsu 87 Table 3 shows the experiment result on the two benchmark datasets, EVA_Set1 and EVA_Set2. It shows that SymPred achieves Q 3 accuracies of 78.8% (SOV=76.4%) and 79.2% (SOV=76.0%), outperforming existing state-of-the-art methods by 1.4% to 5.4%. It can be observed that SymPred performs better than each single predictor on most of performance measurements. EVA_Set1 Q 3 ERRsig Q 3 sov ERRsigsov sovh sove sovc SymPred 78.8 ± ± SAM-T99sec 77.2 ± ± PSIPRED 76.8 ± ± PROFsec 75.5 ± ± PHDpsi 73.4 ± ± EVA_Set2 Q 3 ERRsig Q 3 sov ERRsigsov sovh sove sovc SymPred 79.2 ± ± PSIPRED 77.8 ± ± PROFsec 76.7 ± ± PHDpsi 75.0 ± ± Table 3: The prediction performance of different methods on the EVA benchmark datasets. sovh/e/c values are the specific SOV accuracies of the predicted helix, strand and coil, respectively. The prediction results of other methods on EVA_Set1 and EVA_Set2 are reported at Results on Other Applications Using Essential Residues Although our prediction method and the meta-predictor can generate highly accurate predictions, we are more concerned about the efficacy of the essential residues in various applications that use PSS as a feature for predictions. We demonstrated the efficacy of essential residues by applying ProtoPred with essential residues as input features to protein function prediction and enzyme/non-enzyme classification Protein Function Prediction The knowledge of protein functions is crucial to the understanding of biological process. Since the experimental procedures for protein function annotation are inherently low throughput, the accurate computational techniques for protein function prediction represent useful tools. Automated protein function prediction methods include direct homology-based and indirect subsequence/feature-based approaches. For the indirect subsequence-based approaches, often only specific subsequences are crucial for the protein to perform its function (Pandey et al., 2006). This motivated us to use the essential residues in the function predictions. We downloaded the protein function labels from the Gene Ontology Annotation Database (goa_pdb) (Camon et al., 2004). Since we needed to compile a dataset whose protein sequences are not redundant (mutual sequence identity less than 25%) and each of them is of known secondary structure, we then made an intersection set of goa_pdb with DsspNr-25. The number of proteins is 2677 and the total number of distinct function labels is It is worth to note that the function labels contain all GO annotations for

12 88 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications the 2677 proteins, including the function labels of biological process, molecular functions, and cellular components. For example, the function labels of protein 1ak6 are 3779 (molecular function: actin binding) and 5622 (cellular component: intracellular). In this application, we focus on verifying the efficacy of different sources of PSS. These sources are the real secondary structures, the predicted secondary structures of SymPred, and the predicted secondary structure of PSIPRED. ProtoPred predicts the most specific function label among 1539 candidates for a target protein by using one of the sources of secondary structures rather than general functions. The prediction accuracy is 100% if the predicted function label belongs to the target protein, otherwise it is 0%. For example, if we predict 1ak6 as the function 3779 (or 5622) then the accuracy is 100%. The hierarchical structure of GO annotations is not exploited in our prediction method, though it could be used to improve prediction accuracy (Eisner et al., 2005). ProtoPred extract structure fragments using a sliding window of size w. Table 4 shows the results for several different window sizes. It can be observed that ProtoPred s prediction using the predicted secondary structure of SymPred shows the highest accuracy for all studied window sizes (except the window size of 11 because it is too short to represent the uniqueness of structures for different function classes). For example, for the window size of 51, the prediction accuracies of ProtoPred using the features of real structure, PSIPRED s prediction, and SymPred s prediction are 49.8%, 35.4%, and 57.6% respectively. Notably, the Q 3 of PSIPRED and SymPred on this dataset are 80.3% and 81.1%. Although the performances of PSS prediction of the two methods are similar, the effectiveness is quite different. Moreover, the performance of ProtoPred with SymPred s prediction is also better than that of ProtoPred with real structure. A possible explanation for this discrepancy is that different structures within a protein do not have equal importance for its function. It shows that SymPred can identify the essential residues which are crucial for proteins to perform their functions. Structural identities of low relevance residues dilute the influence of major residues when using the real structure as the feature in the ProtoPred s prediction. Window Size Real Structure PSIPRED SymPred Table 4: The accuracy (%) of function predictions using different structure sources and different window sizes Enzyme/non-enzyme classification Many protein function prediction methods focus on only one specific type of functions (Borgwardt et al., 2005, Dobson & Doig, 2003). The problem of enzyme and non-enzyme classifications is a special case of function prediction. We do not have to predict a functional type but only to distinguish between enzyme and non-enzyme. In Dobson and Doig s study, they use multiple features such as secondary structure, amino acid propensities, and surface properties to do the binary classifications. They further divide the features into 52 sub-features and select 36 optimal sub-features for the SVM models to generate the classifier. The overall accuracies are 77.16% and 80.14% for the two different sizes of sub-features, respectively. We download Dobson and Doig s dataset which contained 1076 proteins. Since SymPred s prediction is the most effective feature among different sources of PSS in the above protein function prediction,

13 H.N. Lin, T.Y. Sung & W.L. Hsu 89 ProtoPred uses SymPred s prediction as the input feature for the problem of enzyme and non-enzyme classifications. ProtoPred achieves an overall accuracy 81.8%. In this application, we only use the secondary structural information for enzyme/non-enzyme classification and achieve a better result. It suggests that the secondary structural information with the essential residue annotation may be sufficient to predict protein functions, which supports the conclusion of Przytycka et al (Przytycka et al., 1999). 4 Conclusions In this paper, we have proposed an improved knowledge based approach called SymPred for PSS prediction. We have also presented a meta-predictor called SymPsiPred, which combines a knowledge based approach (SymPred) and a machine learning based approach (PSIPRED). Tests on a proteome-scale dataset of 8297 protein chains show that the overall average Q 3 accuracy of SymPred and SymPsiPred is 81.0% and 83.9% respectively. SymPred can be regarded as a special case of a template-based approach because it predicts PSS by finding template sequences based on local similarities, i.e., n-gram words. Almost all other methods adopting the predicted secondary structure as a feature are based on the entire sequence; we demonstrate that it is advantageous to weight amino acids according to their importance in determining the structure and function of a protein sequence. We test two different problems to verify the efficacy of the predicted structures and their weighted scores. In the problem of protein function prediction and enzyme/non-enzyme classification, the performance of using the feature of SymPred s PSS prediction result shows that the structural information with essential residue annotation is more suitable for predicting specific protein functions. References Alexander, P. A., He, Y., Chen, Y., Orban, J., & Bryan, P. N. (2007). The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci U S A, 104(29), Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. H., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17), Aydin, Z., Altunbasak, Y., & Borodovsky, M. (2006). Protein secondary structure prediction for a single-sequence using hidden semi-markov models. Bmc Bioinformatics, 7, -. Bondugula, R., & Xu, D. (2007). MUPRED: A tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction. Proteins-Structure Function and Bioinformatics, 66(3), Borgwardt, K. M., Ong, C. S., Schonauer, S., Vishwanathan, S. V. N., Smola, A. J., & Kriegel, H. P. (2005). Protein function prediction via graph kernels. Bioinformatics, 21, I47-I56. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., et al. (2004). The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research, 32, D262-D266. Capra, J. A., & Singh, M. (2007). Predicting functionally important residues from sequence conservation. Bioinformatics, 23(15), Ceroni, A., Frasconi, P., Passerini, A., & Vullo, A. (2003). A combination of support vector machines and bidirectional recurrent neural networks for protein secondary structure prediction. Ai(Asterisk)Ia 2003: Advances in Artificial Intelligence, Proceedings, 2829, Cheng, H. T., Sen, T. Z., Kloczkowski, A., Margaritis, D., & Jernigan, R. L. (2005). Prediction of protein secondary structure by mining structural fragment database. Polymer, 46(12),

14 90 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications Dobson, P. D., & Doig, A. J. (2003). Distinguishing Enzyme Structures from Non-enzymes Without Alignments. Journal of Molecular Biology, 330(4), Eisner, R., Poulin, B., Szafron, D., Lu, P., & Greiner, R. (2005). Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology. Paper presented at the Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '05. Proceedings of the 2005 IEEE Symposium on. Ferre, S., & King, R. D. (2006). Finding motifs in protein secondary structure for use in function prediction. Journal of Computational Biology, 13(3), Fischer, D., Elofsson, A., Rychlewski, L., Pazos, F., Valencia, A., Rost, B., et al. (2001). CAFASP2: The second critical assessment of fully automated structure prediction methods. Proteins-Structure Function and Genetics, Frenkel-Morgenstern, M., Voet, H., & Pietrokovski, S. (2005). Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure. Bioinformatics, 21(13), Gong, H. P., & Rose, G. D. (2005). Does secondary structure determine tertiary structure in proteins? Proteins-Structure Function and Bioinformatics, 61(2), Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2), Jones, D. T., & Swindells, M. B. (2002). Getting the most from PSI-BLAST. Trends in Biochemical Sciences, 27(3), Karplus, K., Barrett, C., & Hughey, R. (1998). Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14(10), Kelley, L. A., & Sternberg, M. J. E. (2009). Protein structure prediction on the Web: a case study using the Phyre server. Nature Protocols, 4(3), Kim, H., & Park, H. (2003). Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering, 16(8), Laskowski, R. A., Watson, J. D., & Thornton, J. M. (2005). ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Research, 33, W89-W93. Li, W. Z., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), Lin, H. N., Chang, J. M., Wu, K. P., Sung, T. Y., & Hsu, W. L. (2005). HYPROSP II - A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence. Bioinformatics, 21(15), Lobley, A., Swindells, M. B., Orengo, C. A., & Jones, D. T. (2007). Inferring function using patterns of native disorder in proteins. Plos Computational Biology, 3(8), Meiler, J., & Baker, D. (2003). Coupled prediction of protein secondary and tertiary structure. Proceedings of the National Academy of Sciences of the United States of America, 100(21), Nair, R., & Rost, B. (2003). Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins-Structure Function and Genetics, 53(4), Nair, R., & Rost, B. (2005). Mimicking cellular sorting improves prediction of subcellular localization. Journal of Molecular Biology, 348(1), Pandey, G., Kumar, V., & Steinbach, M. (2006). Computational Approaches for Protein Function Prediction: Department of Computer Science and Engineering, University of Minnesota, Twin Cities. Pietrokovski, S. (1996). Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Research, 24(19), Przybylski, D., & Rost, B. (2007). Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments. Nucleic Acids Research, 35(7), Przytycka, T., Aurora, R., & Rose, G. D. (1999). A protein taxonomy based on secondary structure. Nature Structural Biology, 6(7),

15 H.N. Lin, T.Y. Sung & W.L. Hsu 91 Rost, B. (2001). Review: Protein secondary structure prediction continues to rise. Journal of Structural Biology, 134(2-3), Rost, B., & Sander, C. (2000). Third generation prediction of secondary structure Protein Structure Prediction: Methods and Protocols (pp ): Humana Press. Rychlewski, L., Jaroszewski, L., Li, W. Z., & Godzik, A. (2000). Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science, 9(2), Sadreyev, R., & Grishin, N. (2003). COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. Journal of Molecular Biology, 326(1), Schlessinger, A., Punta, M., & Rost, B. (2007). Natively unstructured regions in proteins identified from contact predictions. Bioinformatics, 23(18), Smith, T. F., & Waterman, M. S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular Biology, 147(1), Sternberg, M. J. E., & Islam, S. A. (1990). Local protein sequence similarity does not imply a structural relationship. Protein Engineering Design and Selection, 4(2), Su, E., Chiu, H.-S., Lo, A., Hwang, J.-K., Sung, T.-Y., & Hsu, W.-L. (2007). Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinformatics, 8(1), 330. Ward, J. J., McGuffin, L. J., Buxton, B. F., & Jones, D. T. (2003). Secondary structure prediction with support vector machines. Bioinformatics, 19(13), Yona, G., & Levitt, M. (2002). Within the twilight zone: A sensitive profile-profile comparison tool based on information theory. Journal of Molecular Biology, 315(5),

16 92 Chapter 5 Identification of Essential Residues from Protein Secondary Structure Prediction and Applications

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Protein Structure Prediction using String Kernels. Technical Report

Protein Structure Prediction using String Kernels. Technical Report Protein Structure Prediction using String Kernels Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159

More information

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2017 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article An Artificial

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Uros Midic 1 A. Keith Dunker 2 Zoran Obradovic 1* 1 Center for Information Science and Technology Temple

More information

Conditional Graphical Models

Conditional Graphical Models PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,

More information

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction 0 Bayesian Models and Algorithms for Protein Beta-Sheet Prediction Zafer Aydin, Student Member, IEEE, Yucel Altunbasak, Senior Member, IEEE, and Hakan Erdogan, Member, IEEE Abstract Prediction of the three-dimensional

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity 1 frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity HUZEFA RANGWALA and GEORGE KARYPIS Department of Computer Science and Engineering

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields 2010 IEEE International Conference on Bioinformatics and Biomedicine Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields Zhiyong Wang, Feng Zhao, Jian Peng, Jinbo Xu* Toyota

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction Institute of Bioinformatics Johannes Kepler University, Linz, Austria Chapter 4 Protein Secondary

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Template-Based Modeling of Protein Structure

Template-Based Modeling of Protein Structure Template-Based Modeling of Protein Structure David Constant Biochemistry 218 December 11, 2011 Introduction. Much can be learned about the biology of a protein from its structure. Simply put, structure

More information

Efficient Remote Homology Detection with Secondary Structure

Efficient Remote Homology Detection with Secondary Structure Efficient Remote Homology Detection with Secondary Structure 2 Yuna Hou 1, Wynne Hsu 1, Mong Li Lee 1, and Christopher Bystroff 2 1 School of Computing,National University of Singapore,Singapore 117543

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,

More information

Optimization of the Sliding Window Size for Protein Structure Prediction

Optimization of the Sliding Window Size for Protein Structure Prediction Optimization of the Sliding Window Size for Protein Structure Prediction Ke Chen* 1, Lukasz Kurgan 1 and Jishou Ruan 2 1 University of Alberta, Department of Electrical and Computer Engineering, Edmonton,

More information

Protein structure alignments

Protein structure alignments Protein structure alignments Proteins that fold in the same way, i.e. have the same fold are often homologs. Structure evolves slower than sequence Sequence is less conserved than structure If BLAST gives

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data Data Mining and Knowledge Discovery, 11, 213 222, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. DOI: 10.1007/s10618-005-0001-y Accurate Prediction of Protein Disordered

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Protein tertiary structure prediction with new machine learning approaches

Protein tertiary structure prediction with new machine learning approaches Protein tertiary structure prediction with new machine learning approaches Rui Kuang Department of Computer Science Columbia University Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) NEC

More information

Supplementary Materials for mplr-loc Web-server

Supplementary Materials for mplr-loc Web-server Supplementary Materials for mplr-loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to mplr-loc Server Contents 1 Introduction to mplr-loc

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space

The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space Eleazar Eskin Department of Computer Science Engineering University of California, San Diego eeskin@cs.ucsd.edu Sagi

More information

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models Last time Domains Hidden Markov Models Today Secondary structure Transmembrane proteins Structure prediction NAD-specific glutamate dehydrogenase Hard Easy >P24295 DHE2_CLOSY MSKYVDRVIAEVEKKYADEPEFVQTVEEVL

More information

Protein Structure and Function Prediction using Kernel Methods.

Protein Structure and Function Prediction using Kernel Methods. Protein Structure and Function Prediction using Kernel Methods. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Huzefa Rangwala IN PARTIAL FULFILLMENT OF THE

More information

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure Last time Today Domains Hidden Markov Models Structure prediction NAD-specific glutamate dehydrogenase Hard Easy >P24295 DHE2_CLOSY MSKYVDRVIAEVEKKYADEPEFVQTVEEVL SSLGPVVDAHPEYEEVALLERMVIPERVIE FRVPWEDDNGKVHVNTGYRVQFNGAIGPYK

More information

Prediction of protein secondary structure by mining structural fragment database

Prediction of protein secondary structure by mining structural fragment database Polymer 46 (2005) 4314 4321 www.elsevier.com/locate/polymer Prediction of protein secondary structure by mining structural fragment database Haitao Cheng a, Taner Z. Sen a, Andrzej Kloczkowski a, Dimitris

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES 3251 PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES Chia-Yu Su 1,2, Allan Lo 1,3, Hua-Sheng Chiu 4, Ting-Yi Sung 4, Wen-Lian Hsu 4,* 1 Bioinformatics Program,

More information

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Secondary Structure Prediction using Feed-Forward Neural Network COPYRIGHT 2010 JCIT, ISSN 2078-5828 (PRINT), ISSN 2218-5224 (ONLINE), VOLUME 01, ISSUE 01, MANUSCRIPT CODE: 100713 Protein Secondary Structure Prediction using Feed-Forward Neural Network M. A. Mottalib,

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

The typical end scenario for those who try to predict protein

The typical end scenario for those who try to predict protein A method for evaluating the structural quality of protein models by using higher-order pairs scoring Gregory E. Sims and Sung-Hou Kim Berkeley Structural Genomics Center, Lawrence Berkeley National Laboratory,

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Received: 04 April 2006 Accepted: 25 July 2006

Received: 04 April 2006 Accepted: 25 July 2006 BMC Bioinformatics BioMed Central Methodology article Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps Tomas Ohlson 1, Varun Aggarwal

More information

Building 3D models of proteins

Building 3D models of proteins Building 3D models of proteins Why make a structural model for your protein? The structure can provide clues to the function through structural similarity with other proteins With a structure it is easier

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Analysis of N-terminal Acetylation data with Kernel-Based Clustering Analysis of N-terminal Acetylation data with Kernel-Based Clustering Ying Liu Department of Computational Biology, School of Medicine University of Pittsburgh yil43@pitt.edu 1 Introduction N-terminal acetylation

More information

Bioinformatics: Secondary Structure Prediction

Bioinformatics: Secondary Structure Prediction Bioinformatics: Secondary Structure Prediction Prof. David Jones d.t.jones@ucl.ac.uk Possibly the greatest unsolved problem in molecular biology: The Protein Folding Problem MWMPPRPEEVARK LRRLGFVERMAKG

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction

Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction Evolutionary Bioinformatics Original Research Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Profiles and Majority Voting-Based Ensemble Method for Protein

More information

Hybrid Fuzzy Algorithm for Protein Clustering Using Secondary Structure Parameters

Hybrid Fuzzy Algorithm for Protein Clustering Using Secondary Structure Parameters IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 3, Ver. III (May - June 2017), PP 01-07 www.iosrjournals.org Hybrid Fuzzy Algorithm for Protein Clustering

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure The Scientific World Journal Volume 2013, Article ID 347106, 8 pages http://dx.doi.org/10.1155/2013/347106 Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure Yin-Fu

More information

Protein Structure Prediction Using Neural Networks

Protein Structure Prediction Using Neural Networks Protein Structure Prediction Using Neural Networks Martha Mercaldi Kasia Wilamowska Literature Review December 16, 2003 The Protein Folding Problem Evolution of Neural Networks Neural networks originally

More information

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS Item type Authors Publisher Rights text; Electronic Thesis PORFIRIO, DAVID JONATHAN The University

More information

Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology

Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology Roman Eisner, Brett Poulin, Duane Szafron, Paul Lu, Russ Greiner Department of Computing Science University of

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Abstract: We used an algorithm known as statistical coupling analysis (SCA) 1 to create a set of features for building

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES By Feng Gao A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for

More information

Predicting protein contact map using evolutionary and physical constraints by integer programming (extended version)

Predicting protein contact map using evolutionary and physical constraints by integer programming (extended version) Predicting protein contact map using evolutionary and physical constraints by integer programming (extended version) Zhiyong Wang 1 and Jinbo Xu 1,* 1 Toyota Technological Institute at Chicago 6045 S Kenwood,

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

Supplementary Materials for R3P-Loc Web-server

Supplementary Materials for R3P-Loc Web-server Supplementary Materials for R3P-Loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to R3P-Loc Server Contents 1 Introduction to R3P-Loc

More information

Basics of protein structure

Basics of protein structure Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu

More information

Improved Protein Secondary Structure Prediction

Improved Protein Secondary Structure Prediction Improved Protein Secondary Structure Prediction Secondary Structure Prediction! Given a protein sequence a 1 a 2 a N, secondary structure prediction aims at defining the state of each amino acid ai as

More information

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References

More information

Protein Structure Prediction and Display

Protein Structure Prediction and Display Protein Structure Prediction and Display Goal Take primary structure (sequence) and, using rules derived from known structures, predict the secondary structure that is most likely to be adopted by each

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Bioinformatics: Secondary Structure Prediction

Bioinformatics: Secondary Structure Prediction Bioinformatics: Secondary Structure Prediction Prof. David Jones d.jones@cs.ucl.ac.uk LMLSTQNPALLKRNIIYWNNVALLWEAGSD The greatest unsolved problem in molecular biology:the Protein Folding Problem? Entries

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS

PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS T. Z. SEN, A. KLOCZKOWSKI, R. L. JERNIGAN L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA

More information

proteins Refinement by shifting secondary structure elements improves sequence alignments

proteins Refinement by shifting secondary structure elements improves sequence alignments proteins STRUCTURE O FUNCTION O BIOINFORMATICS Refinement by shifting secondary structure elements improves sequence alignments Jing Tong, 1,2 Jimin Pei, 3 Zbyszek Otwinowski, 1,2 and Nick V. Grishin 1,2,3

More information

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer 2013 9. Protein Structure Prediction I Structure Prediction Overview Overview of problem variants Secondary structure prediction

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information