protein. Evaluation of transmembrane helix predictions in 2014 Jonas Reeb, 1 Edda Kloppmann, 1,2 * Michael Bernhofer, 1 and Burkhard Rost 1,2,3,4

Size: px
Start display at page:

Download "protein. Evaluation of transmembrane helix predictions in 2014 Jonas Reeb, 1 Edda Kloppmann, 1,2 * Michael Bernhofer, 1 and Burkhard Rost 1,2,3,4"

Transcription

1 proteins STRUCTURE O FUNCTION O BIOINFORMATICS Evaluation of transmembrane helix predictions in 2014 Jonas Reeb, 1 Edda Kloppmann, 1,2 * Michael Bernhofer, 1 and Burkhard Rost 1,2,3,4 1 Department of Informatics & Center for Bioinformatics & Computational Biology i12, Technische Universit at M unchen (TUM), Garching/Munich 85748, Germany 2 New York Consortium on Membrane Protein Structure (NYCOMPS), New York Structural Biology Center, New York, New York Institute of Advanced Study (TUM-IAS), Garching/Munich 85748, Germany 4 Institute for Food and Plant Sciences WZW Weihenstephan, Freising, Germany ABSTRACT Experimental structure determination continues to be challenging for membrane proteins. Computational prediction methods are therefore needed and widely used to supplement experimental data. Here, we re-examined the state of the art in transmembrane helix prediction based on a nonredundant dataset with 190 high-resolution structures. Analyzing 12 widely-used and well-known methods using a stringent performance measure, we largely confirmed the expected high level of performance. On the other hand, all methods performed worse for proteins that could not have been used for development. A few results stood out: First, all methods predicted proteins in eukaryotes better than those in bacteria. Second, methods worked less well for proteins with many transmembrane helices. Third, most methods correctly discriminated between soluble and transmembrane proteins. However, several older methods often mistook signal peptides for transmembrane helices. Some newer methods have overcome this shortcoming. In our hands, PolyPhobius and MEMSAT-SVM outperformed other methods. Proteins 2015; 83: VC 2014 Wiley Periodicals, Inc. Key words: membrane protein; transmembrane helices; transmembrane helix; a-helical membrane protein; transmembrane helix prediction; evaluation. INTRODUCTION Transmembrane proteins Protein complexes that are embedded into the membrane carry out essential processes such as transport, signaling, or adhesion. Transmembrane proteins (TMPs) are involved in many diseases, including cancer, diabetes, or cystic fibrosis. For example, G protein-coupled receptors (GPCRs) are essential for cell signaling 1 and constitute one of the largest families of TMPs in eukaryotes, with almost 800 genes in human. 2 To highlight their relevance: two recent Nobel Prizes were awarded for studies on GPCRs and 30% of today s drugs target GPCRs. 3,4 TMPs are assumed to pass the membrane either exclusively with transmembrane alpha-helices (TMHs) or with transmembrane beta-strands by forming beta-barrels. Beta-barrels have so far been found in gram-negative and acid-fast bacteria, as well as in chloroplasts and mitochondria. 5 Here, we focus on alpha-helical TMPs, the most abundant class of TMPs, assumed to constitute 20 30% of the proteome of any organism. 6 9 Additional Supporting Information may be found in the online version of this article. Abbreviations: DBN, dynamic Bayesian network; GPCR, G protein-coupled receptor; HMM, hidden Markov model; OPM, Orientations of Proteins in Membranes; PDB, Protein Data Bank; PDBTM, Protein Data Bank of Transmembrane Proteins; NN, (artificial) neural network; SVM, support vector machine; TMH, transmembrane alpha-helix; TMP, (alpha-helical) transmembrane protein. Grant sponsor: Alexander von Humboldt foundation through the German Federal Ministry for Education and Research (BMBF); Grant sponsor: New York Consortium on Membrane Protein Structure (NYCOMPS) from the Protein Structure Initiative (PSI) of the National Institutes of Health (NIH); Grant number: U54 GM *Correspondence to: Edda Kloppmann, Department of Informatics & Center for Bioinformatics & Computational Biology i12, Technische Universit at M unchen (TUM), Boltzmannstr. 3, Garching/Munich, Germany edda.kloppmann@mytum.de Received 5 September 2014; Revised 2 December 2014; Accepted 13 December 2014 Published online 26 December 2014 in Wiley Online Library (wileyonlinelibrary. com). DOI: /prot VC 2014 WILEY PERIODICALS, INC. PROTEINS 473

2 J. Reeb et al. Table I Transmembrane Helix Prediction Methods Name Year Method Evolutionary information Signal peptides Topology TopPred No No Yes PHDhtm 1995 NN Yes No Yes HMMTOP HMM No No Yes TMHMM HMM No No Yes SOSUI 2002 No No No Phobius 2004 HMM No Yes Yes PolyPhobius 2005 HMM Yes Yes Yes MEMSAT NN Yes Yes Yes Philius 2008 DBN No Yes Yes SCAMPI 2008 HMM No No Yes SPOCTOPUS 2008 NN1HMM Yes Yes Yes MEMSAT-SVM 2009 SVM Yes Yes Yes Listed are all TMH prediction methods evaluated in chronological order. We chose 12 methods based on popularity and availability for an overview of developments in TMH prediction during the last 20 years. Given is the name of the method, year of publication, and the machine learning approach used for the prediction (NN: neural network, HMM: Hidden Markov model, SVM: Support vector machine, DBN: dynamic Bayesian network). We indicate whether evolutionary information from multiple sequence alignments is used, as well as, whether signal peptides and topology (inside/outside location of non-tmh regions with respect to membrane) are predicted. Experimental determination of high-resolution structures for TMPs continues to be challenging. 10,11 Therefore, these proteins are substantially underrepresented in the Protein Data Bank (PDB 12 ): fewer than 2% of the PDB structures are TMPs. 11,13 Even when the experimental structures are known, the location of the membrane segments remains to be estimated by algorithms such as the ones used by OPM (Orientation of Proteins in the Membrane 14 ) and PDBTM (Protein Data Bank of Transmembrane Proteins 13 ) or by curated annotations such as in MPtopo (Membrane Protein Topology Database 15 ). The gap between what we want to know about TMPs (given their biomedical importance) and what we do know in terms of experimental structures makes prediction methods particularly important. Although the threedimensional (3D) structure can be predicted de novo 16,17 or through comparative modeling, 18 for the majority of TMPs we have yet to be content with the prediction of secondary structure and topology. 19 Reliable methods for the prediction of secondary structure in nonmembrane proteins also help in the annotation and understanding of TMPs. However, these predictors do not distinguish between alpha-helices within or outside the membrane. For TMPs, knowing the precise location of transmembrane regions is of high importance to understand protein function and to design experiments. Several independent evaluations revealed TMH predictions to be rather successful A recent analysis focused on evaluating correctly predicted topology. 28 We left this perspective out, both to reduce publication redundancy and because obtaining reliable, nonambiguous topology data for our dataset is not trivial. Here, we complemented previous assessments by tapping into today s wealth of high-resolution data. We assembled distinct subsets to highlight different aspects of performance: (i) new vs. old compares performance for TMPs unknown and known during development, (ii) eukaryotic vs. bacterial allows to spot origin-specific aspects, and (iii) sets of proteins with different numbers of TMHs. We considered TMH annotations from two different databases (OPM and PDBTM) to prevent algorithmic bias. In addition, we applied a more stringent criterion for considering a TMH to be correctly predicted. Finally, we also used a set of soluble proteins and proteins (TMPs as well as soluble) with signal peptides. Over 30 advanced methods predict TMHs; some have been used for over two decades. 31 We focused on the following 12 (Table I): HMMTOP2, 32 MEMSAT3, 33 MEMSAT-SVM, 34 PHDhtm, 35 Philius, 36 Phobius, 37 PolyPhobius, 38 SCAMPI, 39 SOSUI, 40 SPOCTOPUS, 41 TMHMM2, 42 and TopPred2. 43 MATERIALS AND METHODS TMPs from the PDB A major challenge for the development and evaluation of transmembrane helix (TMH) prediction methods is the extraction of comprehensive and accurate datasets. In the past, such datasets relied on TMH annotations from biochemical experiments. In 2000, M oller et al. assembled a widely used dataset 44 that contained carefully extracted information from biochemical experiments, which was later referred to as low-resolution information in contrast to high-resolution information from, for example, X-ray crystallography. 24 Surprisingly, such low-resolution biochemical data turned out to not be significantly more accurate than prediction methods. 24,27,34 Therefore, more recent evaluations focus on high-resolution structures. 24,34,41 We used 462 TMPs (note: here exclusively denoting proteins with TMHs) common to three databases 474 PROTEINS

3 Transmembrane Helix Prediction Evaluation Table II Discrimination Between TMPs and Soluble Proteins As Well As TMHs and Signal Peptides no SPs SPs Sol Sol TMP Euk (5106) Gram2 (356) Gram1 (911) Euk (1297) Gram2 (400) Gram1 (204) Euk (332) FPR FPR FPR FPR FPR FPR Sens TopPred2 a PHDhtm a HMMTOP2 a TMHMM2 a SOSUI a Phobius PolyPhobius MEMSAT Philius SCAMPI a SPOCTOPUS MEMSAT-SVM a Methods predict only TMHs, no option to predict SPs. Shown are discrimination rates between soluble proteins (Sol) and TMPs and between signal peptides (SP) and TMHs. The datasets are assembled from SignalP4 training sets 49 : (i) three sets of soluble proteins without SPs: Sol eukaryotic, Gram-positive and Gram-negative bacteria, (ii) soluble proteins with SP (Sol(SP)) and (iii) a set of eukaryotic TMPs with SPs. Set sizes are given in the table in brackets. For soluble proteins, we consider as false positives all proteins with a predicted TMH and denote the false positive rate (FPR). For the TMP(SP) set the sensitivity is given. Rates and the sensitivity are given in percent (%) of all proteins of the respective set. Percentages are rounded to integers, except for values >99.5. For example PolyPhobius misclassifies 62 of 1297 proteins in the eukaryotic Sol(SP) set, that is, FPR 5 5%. (retrieved on September 2013): PDB, 12 OPM, 14 and PDBTM. 13 These 462 proteins have 1101 PDB chains. We mapped the TMH annotation based on the PDB ATOM record to the UniProt sequence using SIFTS. 45,46 The 1101 PDB chains do not contain chimeras or structure models and have a unique residue-level mapping between UniProt and ATOM record. For every sequence, all residues in an annotated TMH can be completely mapped. The resolution of the respective structures is at least 9 Å (average 2.9 Å). These 1101 sequences were redundancy reduced at HVAL>0 using UniqueProt. 47 This implied that no pair of proteins had >20% pairwise sequence identity over alignment lengths of >250 residues. We weighted sequences to prefer smaller fractions of unmapped residues and longer sequences. We excluded the PDB chain 3n23:G (UniProt AC Q58K79) describing the Na1/K1-ATPase gamma subunit (annotated as TMP in PDBTM), because its pseudo-tmh is buried in a larger protein complex rather than contacting the lipid bilayer. This resulted in an independent test set of 190 TMPs (Supporting Information Table S1). PDBTM and OPM differ for several proteins, for example, in their annotations of the first (Nterm) and last residues (C-term) in TMHs. We used both annotations by considering predictions correct if they matched any of the two. Although these databases might contain mistakes, we nevertheless prefer to use the systematic bias from consistent automated annotations as opposed to fully manual annotations which introduce a different set of challenges. As an exception, we manually edited annotations in six OPM entries and in one PDBTM entry, related to kinked and re-entrant helices as Table III Evaluation Scores Name Formula Description Q tmh %obs number of correctly predicted TMHs/number of TMHs observed TMH recall Q %pred tmh number of correctly predicted TMHs/number of TMHs predicted TMH precision Qok (%) 100 Nprot 3 P Nprot i¼0 d i; d i ¼ 8 < : 1; if Q %obs tmh ¼ 1 ¼ Q %pred tmh 0; else Percentage of proteins for which all TMH positions are correctly predicted FPR (%) 100 false positives/(false positives + true negatives) Percentage of false positives among negatives Sens (%) 100 true positives/(true positives + false negatives) Percentage of detected positives Per-segment recall (Q tmh %obs ) and precision (Q %pred tmh ) are pooled for all TMHs of all proteins in the dataset and then averaged once. Qok, that combines recall and precision, is calculated for every protein. 24 False positive rate (FPR) and Sens(itivity) are employed for the scoring of discrimination between soluble proteins and TMPs. PROTEINS 475

4 J. Reeb et al. detailed in Supporting Information Table S2. In addition to the TMH annotations from OPM and PDBTM, we used alpha-helix (H) annotations from DSSP (CMBI version, April 2010). 48 First, we extended the TMHs from OPM and PDBTM, to the full length of the DSSPassigned alpha-helix, thereby ignoring membrane boundaries. Second, we exclusively used alpha-helix annotations from DSSP, ignoring OPM and PDBTM. TMP subsets We separated the 190 TMPs into three subsets. Subset 1, new vs. old: We BLASTed (e-value < 0.1) our TMP dataset against all datasets used for the development of the methods we analyzed. The sets for HMMTOP2 and PHDhtm had to be recreated according to the identifiers listed in the original publications. For HMMTOP2 Uni- Prot ID tcr1_ecoli and for PHDhtm UniProt ID a1aa_human could not be mapped unambiguously and were excluded. In this way, we obtained a subset with 146 old TMPs (used for development) and 44 new TMPs (not used for development). Subset 2, eukaryotic vs. bacterial: We separately analyzed performance for 75 eukaryotic and 104 bacterial TMPs. The remaining 11 TMPs were viral (2) and archaeal (9), too few to support further separation. Subset 3, grouping by number of TMHs: We separated the TMPs by the number of TMHs annotated by OPM and PDBTM into three sets with 1 TMH, 2 5, and >5 TMHs. For this, five entries (2lp1:A, 3t9n:E, 1jb0:F, 4hzu:S, and 4i9w:B) had to be excluded, since they would fall into different bins due to differing numbers of annotated TMHs (Supporting Information Table S1). Human proteome The complete human proteome with 20,258 sequences was retrieved from UniProtKB/Swiss-Prot (release 2014_04). Predictions were performed with PolyPhobius using the same settings as for the all other predictions. A single entry (UniProt AC Q8WZ42, Titin) had to be excluded because it was too long for that method. All statistics mentioned in Results and Discussion for the comparison against our dataset, are calculated for 6016 proteins that are predicted to be TMPs. Soluble non-tm proteins Sets of soluble proteins without signal peptides were assembled from the homology reduced SignalP4 data. 49 We excluded 12 proteins that could not be handled by all methods. Finally, we used three sets: a eukaryotic soluble set with 5106 proteins, a set from Gram-negative bacteria with 356 proteins, and another from Grampositive bacteria with 911 proteins. The SignalP4 dataset has no entries from Archaea or viruses. For an additional comparison on the human proteome, the complete proteome of 20,258 sequences was used. Sequences were subjected to prediction with SignalP4.1, which resulted in the following estimates: 14,149 sequences soluble without signal peptide, 3220 soluble with signal peptide, and 2889 TMPs. Error rates from Table II were then applied to these sets. For example, Phobius has error rates of 3%, 4%, and 1% (99% sensitivity) for the previous cases on eukaryotic proteins. Applying these to the estimates on the proteome results in 14, errors on soluble proteins without signal peptides, 129 on soluble with signal peptides and 29 on TMPs. Overall, this suggests that approximately 2.9% of the human proteome are predicted wrongly. Datasets with signal peptides Signal peptides are typically residues long and mediate specific targeting of proteins to different cellular compartments. 50,51 The central section of a signal peptide is commonly highly hydrophobic, thereby resembling TMHs. Some signal peptides are, indeed, not cleaved off after reaching the target membrane but rather are embedded as a TMH. 52 Many TMH prediction methods predict signal peptides along with TMHs (Table I). We evaluated confusions between signal peptides and TMHs using the following sets of soluble proteins and TMPs with signal peptides. 49 Four signal peptide datasets were built from the SignalP4 data. Soluble proteins and TMPs containing a signal peptide were collected and then split into eukaryotic, Gram-negative, and Gram-positive entries. We excluded 10 proteins that could not be handled by all methods. The final sets of soluble proteins with signal peptides included 297 eukaryotic, 400 Gram-negative, and 204 Gram-positive proteins. The TMP dataset with signal peptides contained 332 eukaryotic TMPs. There were too few bacterial TMPs with signal peptides (22 Gramnegative, 3 Gram-positive) to compile performance separately. Evaluation measures We assessed performance through the standard measures: We largely focused on the per-protein score Qok 24 (Table III). For this, a TMH counted as correctly predicted, when both TMH endpoints differed less than five residues between observation and prediction (Fig. 1A). In addition, any observed TMH matched maximally one predicted TMH and vice versa. This constraint implicitly handled the correct prediction of the number and placement of all TMHs simultaneously. We further calculated performance based only on OPM or PDBTM annotations and found performance to be higher and less deviating for OPM (mean Qok ) than for PDBTM (mean Qok ; data not shown). The above constraints, in particular the required overlap, yielded significantly more 476 PROTEINS

5 Transmembrane Helix Prediction Evaluation Figure 1 Transmembrane helix prediction performance. Qok scores for all 12 prediction methods on various sets of TMPs. Qok denotes the percentage of proteins for which all TMHs were correctly predicted (A: TMH endpoints within five or less residues of either OPM or PDBTM annotation for the whole protein, Methods). Above the bars are the numbers of proteins in each dataset. Error bars are the sample standard deviation generated by bootstrapping with 1000 draws of half the set size each (cf. Methods). B: Qok is plotted for 190 redundancy-reduced TMPs followed by 44 new (not used for development) and 146 old (used for development, either the protein itself or homologous proteins) TMPs. All methods clearly performed worse for more recently determined protein structures. The old new difference for TopPred2 suggested that a significant fraction of the differences might not be explained by over-training. C: All methods reached higher Qoks for eukaryotes than for bacteria. Note that we excluded the nine archaeal and two sequences of viral origin. D: Performance declines from bitopic TMPs to those with 2 5 TMHs or more. For (D), the number in brackets behind the set size denotes the number of TMHs in the respective subset. conservative estimates than those provided in previous evaluations. Errors were estimated through bootstrapping. 53 As the standard version with putting the data back into the pot tends to underestimate the scientific error, we used a version in which we randomly drew 1000 times half the dataset size (for example, 1000 sets with 95 TMPs each for assessing the error for the full dataset). The sample standard deviation was calculated as square root of the sample variance over the 1000 draws. For the sets of soluble proteins, a prediction was considered as incorrect (FP: false positive), if any TMH was predicted. The calculation of false positive rate (FPR), and sensitivity for the TMP datasets, is defined in Table III. For the sets with signal peptides, we additionally checked if an incorrectly predicted TMH overlapped with an annotated signal peptide (FPR_SP). For this score, we tested 6 overlaps between 1 to 20 residues. In the reported results, we considered an overlap of predicted TMH and observed signal peptide by more than 7 residues as incorrect. Transmembrane helix prediction methods The 12 prediction methods were chosen according to various criteria. We sampled some methods of historical importance for comparison and focused on the more recent methods that were readily available and had been claimed to perform at the top. The contenders were as follows. PROTEINS 477

6 J. Reeb et al. Figure 2 Transmembrane helix statistics. The plots show statistics for the original set of 1101 TMPs (plotted as dark grey bars in A C) as well as the redundancy reduced set of 190 TMPs (plotted as light grey bars in A C). We consider annotations from OPM and PDBTM for the evaluation. Therefore, the statistics shown are based on both annotations as well. A: Distribution of the number of TMHs per protein. B: Distribution of TMH lengths, where the 1101 proteins in the original set contain 10,676 TMHs and the 190 proteins in the redundancy contain 1148 TMHs. The average TMH length of 20 residues is indicated by the dotted line. C: Fraction of non-tm residues per protein. Statistics for the eukaryotic, bacterial, new, and old subsets can be found in Supporting Information Figures S1 S4. No machine learning TopPred2 43 uses hydrophobicity scales through a relatively simple sliding window approach and chooses the final topology based on the positive-inside rule. 19 SOSUI 40 uses the Kyte Doolittle hydrophobicity scale. 54 Additionally, SOSUI employs an amphiphilicity index to account for weak and strong polar residues at the helix ends. 55 Neural networks (NN) PHDhtm 35,56 was one of the first TMH prediction methods based on machine learning. It also was the first to improve performance through the use of evolutionary information from multiple sequence alignments. MEM- SAT3 expanded this by also predicting signal peptides. 33 MEMSAT3 applies an external filter to distinguish between globular and membrane proteins. The most recent NNbased method in our evaluation was SPOCTOPUS. 41 SPOCTOPUS which further develops OCTOPUS 57 consists of several NNs for different residue types, one of which models re-entrant helices. It contains an additional NN with an adjunct hidden Markov model to improve discrimination between signal peptides and TMHs. Hidden Markov models (HMM) HMMTOP2 32 was the second TMH prediction method based on HMMs (TMHMM was the first); it relies on the divergence in amino acid distribution between structural components of the protein. TMHMM2 42 is based on a cyclic HMM with modules for a helix core, capping regions and loops on either side of a transmembrane helix as well as a single state for globular sequence parts, that is, loops longer than 20 residues. A major limitation that was already pointed out by the authors in the original publication is the high FPR due to the confusion of signal peptides and TMHs. This shortcoming was tackled by Phobius 37 which essentially combines the HMMs of TMHMM and the signal peptide model of SignalP- HMM. 58 PolyPhobius 38 additionally incorporates homology information through a new HMM decoder. SCAMPI 39 was developed with the premise that the biological machinery does not know about amino acid distributions and that successful predictions should be possible from physical principles. In contrast to the other machine learning-based methods that optimize many parameters, SCAMPI, although technically an HMM, only optimizes two parameters and bases the prediction on the estimated 478 PROTEINS

7 Transmembrane Helix Prediction Evaluation free energy of membrane insertion, given a segment of 21 residues and based on a previously experimentally developed propensity scale. Bayesian networks Dynamic Bayesian networks can be considered generalizations of HMMs. We evaluated one method based on this concept, namely, Philius 36 which uses a state transition diagram equal to that employed by Phobius. Support vector machines (SVMs) MEMSAT-SVM, the most recent method evaluated here, consists of five separate SVMs and incorporates homology information as well as discrimination between TMHs and signal peptides. It also treats re-entrant helices separately. For all methods, we used the default parameters and ran the methods on our machines where stand-alone versions were available. For all others (SOSUI, Philius), the respective webservers were employed. As proposed by the authors, multiple sequence alignments for PolyPhobius were created using Kalign2. 59 All methods requiring a BLAST database were run on UniProtKB/Swiss-Prot release 2013_09. For tests regarding the effect of the database size, UniProtKB/ Swiss-Prot release 2014_04 was used, as well as a database assembled from merging PDB, UniProtKB/Swiss-Prot and TrEMBL of the same release, redundancy reduced with CD- HIT 60 at 98% pairwise sequence identity. The base performance was estimated by three simple prediction methods: RANDOM first randomly predicted the number of TMHs (according to the TMH distribution in the full dataset). The randomly predicted TMHs were then inserted consecutively, nonoverlapping, with lengths according to a normal distribution approximating helix lengths in the dataset (l , r 5 4.3). Another random prediction method, SEMIRANDOM, started with the correct number of TMHs for every protein, and then placed TMHs at random along the sequence. The third background method, HYDRO, used hydrophobicity values from the Eisenberg scale. 61 For a fixed window size of 21, the sum of hydrophobicity values was calculated at each position in the sequence. A helix was predicted for all windows with a sum > 4, beginning from the highest values until all TMHs were placed. Where two helices overlapped, the initial prediction was retained while all subsequent overlapping TMHs were removed. RESULTS AND DISCUSSION Assembling a curated dataset of high-resolution TMP structures To assess the performance of transmembrane helix (TMH) prediction, we assembled a new dataset of TMPs. Our final curated and redundancy reduced dataset contained 190 unique TMPs from the PDB (Supporting Information Table S1, available along with our evaluation protocol through We annotated the observed TMHs using OPM and PDBTM. 13,14 For 15 cases, the database annotations disagreed on the number of TMHs they annotated (Supporting Information Table S1). For these cases, prediction methods seem to favor the annotations provided by OPM (Supporting Information Table S3). Our TMP dataset contained 190 proteins with 1 14 TMHs (Fig. 2A). This distribution was similar before and after redundancy reduction. The largest differences were found for TMPs with one TMH (49% after redundancy-reduction, 25% before). The distribution of TMH lengths (mean 20 residues) also did not change on redundancy reduction (Fig. 2B). The standard deviation in TMH length is slightly larger for structurederived annotations than for predictions (Supporting Information Table S4). Six TMHs in our dataset were 10 residues long; five of the six came from OPM. The longest TMH with 38 residues is annotated by OPM, and describes a very angular TMH (PDB ID 3tij:A, TMH 7). About 37% of all residues in TMPs were in TMHs, similar to the percentage before the redundancyreduction (Fig. 2C). The few cases with <10% TM residues were TMPs with only one or two TMHs. Statistics for the eukaryotic, bacterial, new and old subsets can be found in Supporting Information Figures S1 S4. Performance estimated through stringent overlap criterion Measured by two-state per-residue scores, that is, fraction of residues correctly predicted as TMH or non- TMH, many prediction methods appeared to be very good (80% precision and recall), and also very similar (Supporting Information Table S5 and Figs. S5 and S6). This has been noted before and it is debatable whether experimental annotation is accurate enough to consider these measures meaningful. 27 Therefore, we measured performance through the Qok score, that is, the percentage of proteins for which all TMHs were correctly predicted (Table III). Typically, TMHs have been considered as correctly predicted when as few as 3 5 of the 20-residues-long TMH were matched. 24,34,38 This tends to overestimate performance. Since most TMH predictions are indeed rather reliable, we introduced a more stringent criterion: if the start and end points of a predicted TMH differ by at most n residues from the start and end points of an observed TMH, this helix is considered as correctly predicted (schematic in Fig. 1A). Although the method ranking changed only slightly depending on this parameter (we compared values n 5 3 7), the actual values differed substantially: from average Qok 5 28% for n 5 3to average Qok 5 63% for n 5 7. For the remainder, we PROTEINS 479

8 J. Reeb et al. focused on results for n 5 5. All 12 methods tested reached Qok>40% (average Qok 5 50%, Fig. 1B). Random predictions reached Qok 5 5%; simple hydrophobicity-based predictions reached Qok 5 38% (cf. RANDOM and HYDRO in Methods). When the OPM and PDBTM annotations differed, we used whichever yielded the more optimistic performance estimate for each prediction method and each TMP. Assessing OPM annotations like a prediction method against the annotations from PDBTM and vice versa, resulted in an overall Qok near the lower-end performing methods (Qok 5 44%; data not shown). Using alpha-helix annotations from DSSP instead of TMH annotations results in a significantly lower performance (Supporting Information Fig. S7: average Qok 5 16% compared to 50% for TMH annotations). We also used DSSP helix annotations to extend the TMHs given by OPM and PDBTM to the full length of the respective alpha-helix. This extended 62% of all TMHs on the N-terminus, and 59% on the C-terminus, both by five residues on average. Extended helices result in a drop in performance (Supporting Information Fig. S7: average Qok 5 29%). These additional analyses underline the unique features of TMHs and the need for specialized prediction methods. Another result in the same direction was that a state-of-the-art method for the prediction of general secondary structure in proteins performed worse by all three measures (Supporting Information Fig. S7), that is, for the TMH annotation (DQok 5 43 percentage points (pp) with respect to the best TMH prediction method), for DSSP elongated helices (DQok 5 17 pp) and for DSSP alpha-helix annotation (DQok 5 4 pp). Error estimates remain challenging Our TMP dataset contained only TMPs of known 3D structure. Although the number of TMPs with known 3D structure has grown substantially over the last decades, the resulting dataset is still small. 11 However, experimental TMH annotations from other biochemical assays tend to contain errors on a similar scale as prediction methods. 24 For the final 190 TMPs in our dataset, we estimated errors by bootstrapping. 53 Almost all 12 methods fell within one standard deviation interval of each other given those error estimates. However, this did not imply the differences to be statistically insignificant. 62 TMH predictions largely successful TopPred2 was developed more than 20 years ago and still reached Qok 5 48%, that is, almost the average performance level. More recent methods tended to perform slightly better (Qok > 50%). However, the differences were often insignificant. Two methods peaked at Qok > 55%: PolyPhobius performed best on the complete dataset with Qok 5 60%, followed by MEMSAT-SVM (Qok 5 57%). For more than 70% of the proteins all methods correctly predicted the number of TMHs (Supporting Information Table S6 and Fig. S8). PolyPhobius (84%) and SCAMPI (83%) performed best. Methods tended to not detect TMHs more often than to over-predict. Most mistakes were minor: for >93% of the proteins the predicted number of TMHs was either correct or only off by one TMH (Supporting Information Table S6 and Fig. S8). We applied PolyPhobius to the human proteome to estimate bias in our experimental dataset (Fig. 2) with respect to all proteins in a complete proteome. We observed only one major difference in the number of TMHs: PolyPhobius predicted 16% of the TMPs in the human proteome to have seven TMHs as opposed to only 3% in our dataset (data not shown). The average observed TMH in our dataset was 20 residues long; this was similar to the average of 22 residues for the human proteome predicted by PolyPhobius (cf. Supporting Information Table S4). The amount of non-tmh residues was higher for the human proteome: 25% of the proteins had % non-tmh residues and very few cases were predicted with <35% non-tmh residues. New proteins predicted less accurately Prediction performance significantly decreased for new proteins, that is, those that have no homologs in any of the methods training sets (Fig. 1B). Since all prediction methods optimize some parameters, differences in performance between new (unused for training) and old (used for training) might point at over-training. However, TopPred2 did not really train, and even for TopPred2 we observed a substantial difference in Qok of 25 pp (Fig. 1B, Qok(old) 5 54% vs. Qok(new) 5 29%). The lower performance for new proteins might therefore indicate that newer structures reveal new and complex structural features such as transmembrane segments that are bent or kinked. This view is supported by the performance of our hydrophobicity-based predictor HYDRO (cf. Methods) which behaved similar to the other methods (DQok(old new) 5 27 pp), while the random predictor RANDOM remained unaffected (DQok 5 3 pp). If true, differences significantly larger than 25 pp could be explained by over-training which seemed more detrimental for new than for old methods: The average DQok(old new) for the five methods published before 2002 (Table I) was 20 pp, while that for the seven newer methods was 30 pp. The exception was SCAMPI (DQok 5 15 pp), a new method that optimized only two free parameters. The subset of new proteins was too small to support clear conclusions. Nevertheless, we observed that the ranking of methods shifted: only SCAMPI reached higher Qok values than the other methods for the new structures. Given the error rates, this difference was within the range 480 PROTEINS

9 Transmembrane Helix Prediction Evaluation of one standard deviation (Fig. 1B), that is, we observed a trend, not a statistically significant difference. Prediction performance higher for eukaryotic sequences All methods performed better for eukaryotic than for bacterial proteins (Fig. 1C). This surprising new finding contradicted previous reports. 24,25 We found the smallest difference in Qok for MEMSAT-SVM and PHDhtm (DQok 5 12 pp). As these two represent new and old methods, the improved prediction for eukaryotes seemed unaffected by over-training or novel TMP structures. Overall, PolyPhobius and MEMSAT-SVM stood out amongst the best for eukaryotes and bacteria. Worse predictions for proteins with many TMHs Performance tended to decrease for proteins with more TMHs (Fig. 1D). Numerically, all methods reached the highest values for proteins with a single TMH. First, all proteins in our experimental set had TMHs. Second, since most methods get the number of TMHs right most of the time (Supporting Information Fig. S8), the odds are much higher for proteins with a single TMH to be fully correct. At the same time, proteins with more TMHs can contain different signals and therefore pose an additional challenge for prediction. Qok dropped strongly from one TMH to two TMHs (Supporting Information Fig. S9). The same logic applied for the comparison between proteins with 2 5 and those with >5 TMHs. We found it impossible to separate the statistical effect from a possible inherent characteristic of polytopic TMPs. The reason was that different random models gave different answers to the question whether or not proteins with one TMH were predicted better (Supporting Information Figs. S10 and S11). A similar correlation between the number of TMHs and Qok has been observed before. 24 Possibly, performance decreases because bitopic TMPs are, on average, more hydrophobic than polytopic TMPs. 63 This hypothesis is supported by the methods that arguably rely most on hydrophobicity: TopPred2 and our simple hydrophobicity-based prediction. Their difference in Qok (DQok 5 47 pp and 51 pp, respectively) is among the largest between bitopic (Qok(TMHs 5 1)) and polytopic (Qok(TMHs 5 2 5) 1 Qok(TMHs > 5)) TMPs. We observed the lowest difference for the newest method (MEMSAT-SVM, DQok 5 24 pp). Re-entrant helices did not significantly impact our overall results Recently determined structures shed light on the prevalence of an additional type of membrane helices, namely re-entrant helices that do not cross the membrane but instead enter and exit on the same side. Here, we treated re-entrant regions as TMHs throughout, since only two of the evaluated methods (SPOCTOPUS and MEMSAT- SVM) distinguish re-entrant from transmembrane helices. However, we investigated a subset of 175 proteins without re-entrant helices. For this subset, performance appeared slightly higher (average Qok 5 53% compared to 50% for the full set). The ranking of the methods was not significantly affected (data not shown). Discrimination between soluble proteins and TMPs clearly sets methods apart Up to this point, our analysis focused on proteins known to contain TMHs, that is, we have provided estimates that hold for about one-fourth of all proteins in most organisms. Do the same methods also correctly recognize the other three-fourths as non-tmp, that is, soluble proteins? Many proteins entering the secretory pathway have signal peptides that resemble TMHs (and are occasionally also embedded as TMHs). 50,51 More recent methods account for this phenomenon (Table I). However, older methods not accounting for signal peptides often confuse these with TMHs (Table I). Therefore, we differentially assessed nonmembrane proteins with and without signal peptides. For proteins without signal peptides, TMHMM2 and Philius appeared best at distinguishing soluble proteins and TMPs (FPR<2%, Table II). For proteins with signal peptides, the FPR for TMHMM2 shot up (>20%) because it does not account for signal peptides. In contrast, Philius still remained below 3% error except for proteins with signal peptides from Gram-positive bacteria for which the error increased to 12% (which was still the best performance, followed by PolyPhobius, SPOCTOPUS, and Phobius; TableII).Phobius,PolyPhobius,andPhiliusperformed best overall for proteins with and without signal peptides. Similar conclusions have been suggested previously. 28 On the flip side of the coin, Philius had the lowest sensitivity in detecting TMPs, that is, its power in recognizing proteins without TMHs came at the price of missing many TMPs. To put Table II in context of an entire organism, we applied the FPRs and sensitivity for eukaryotic proteins to the complete human proteome (Methods). Based on this estimate, Phobius and Philius gave the best compromise between missing (Table II: sensitivity) and over-predicting TMPs (Table II: FPRs for soluble eukaryotic proteins with and without signal peptides). Both predict 2.9% of the proteome incorrectly as either TMPs or soluble proteins, followed by TMHMM2 (4%) and PolyPhobius (6%). Given the good PolyPhobius prediction for the TMH locations (Fig. 1), and the very fast runtime of Phobius it might be an option to run Phobius first to find TMPs and PolyPhobius afterward to predict where the TMHs are. PROTEINS 481

10 J. Reeb et al. Several of the older methods that had access to significantly less data during development than more recent methods were among the top performers in terms of distinction between TMPs and soluble proteins without signal peptides. Namely, SOSUI (from 1998) and TMHMM2 (2001) with average FPRs<2%. On the other hand, very recent methods, using the most up to date datasets, did not necessarily distinguish well between TMPs and soluble proteins, as exemplified by MEMSAT-SVM. Among the methods that do not account for signal peptides, TopPred2 and SCAMPI stood out as very poor filters: they incorrectly predicted at least one TMH for almost every soluble protein with a signal peptide. In both cases, most misclassifications originated from mistaking a signal peptide for a TMH (Supporting Information Table S7: FPR(SP)). In fact, SCAMPI was developed on a datasets of soluble proteins and TMPs that did not contain signal peptides. 39 As suggested before, 49 MEM- SAT3 consistently performed worst among the methods accounting for signal peptides. For most methods, signal peptides in proteins from Gram-positive bacteria constituted the toughest challenge. This difference might be explained by the fact that signal peptides in Gram-positives are longer and more hydrophobic than those in Gram-negatives. 58 However, many of the mistakes in Gram-positives could not be attributed to confusing signal peptides with TMHs (directly measured by FPR(SP)). In fact, most mistakes (FPR) were made outside the regions of the signal peptides (Supporting Information Table S7: FPR(SP) much smaller than FPR). The differences between eukaryotic and Gram-negative proteins were similar albeit less substantial, yielding the overall tendency for the FPR in proteins with signal peptides: Gram-positiveseukaryotes>Gram-negatives. Some methods are more CPU intensive than others The 12 TMH prediction methods could be separated into three groups according to their runtime (Supporting Information Table S8): Fast methods (TopPred2, SCAMPI, TMHMM2, HMMTOP2, and Phobius) predicted >100 proteins per minute on our machines (single core, AMD Opteron 2431, Supporting Information Table S8). Slow methods (PHDhtm, PolyPhobius, MEM- SAT3, and SPOCTOPUS) predicted 4 11 proteins per minute, and the slowest method (MEMSAT-SVM) took several minutes for one protein. Effect of larger sequence database is ambiguous Using evolutionary information as input can improve predictions. 6,35,38 For the five evaluated methods that use such information (cf. Table I), we compared performance for generating alignments from the small Swiss-Prot database (545,000 proteins, release 2014_04) and from a larger database based on the complete Uni- ProtKB merged with PDB, redundancy reduced at 98% sequence identity (21.7 million proteins, release 2014_04, Methods). The effect varied with MEMSAT- SVM benefiting most from the 40 times larger database (DQok>5 pp). PHDhtm and MEMSAT3 remained largely unaffected (DQok pp), while PolyPhobius performed slightly worse (DQok pp). For the subset of new proteins, SPOCTOPUS and MEMSAT3 reached Qok 5 34%, while the other three methods reached Qok 5 31%. Overall, the reasons for these observations remain unclear. One issue could be that on such large databases the number of returned hits exceeds the number of hits expected during development of the method and therefore the majority of them are discarded resulting in a set of low variance. Since the search parameters, the methods use are in nearly all cases not easily customizable and since the larger database also significantly increased the runtime, we recommend using a small database. The prediction results presented in this work were based on alignments built from Swiss-Prot. CONCLUSION We have evaluated how well 12 transmembrane helix (TMH) prediction methods correctly identify all experimentally observed TMHs and how well they distinguish between proteins with and without TMHs (Table I). The analysis used a newly created dataset of 190 nonredundant high-resolution alpha-helical TMPs. Forty-four of these 190 proteins were not sequence similar to any protein used for development of the 12 methods. For our analysis, we used a performance measure which reflects the fraction of proteins in a dataset for which all TMHs are correctly predicted according to annotations of either OPM or PDBTM (Qok). Further, we introduced more stringent criteria than previous scores (difference between predicted and observed TMH endpoints maximally five residues, Qok, Table III) to account for the fact that subsequent experimental and computational methods based on TMH predictions often need precise membrane boundaries. 64,65 A global score such as Qok with our new stringent criteria was crucial, because traditional per-residue scores suggested unrealistically high performance (Supporting Information Figs. S5 and S6). When combining several aspects of performance and runtime, PolyPhobius appeared to be the best method, followed by MEMSAT-SVM (Fig. 1). Our observation that performance dropped for many methods when evaluated on proteins for which structures were published after the method development, once again, highlighted the continued problem of over-training. Conversely, 482 PROTEINS

11 Transmembrane Helix Prediction Evaluation some of this drop could be explained by the fact that newly determined structures contain important novel information that will be needed to improve prediction methods of the future. Earlier analyses suggested lower performance for eukaryotic TMPs than for bacterial TMPs. 25,33 Our result, in contrast, suggested the opposite: performance was lower for bacterial than for eukaryotic TMPs (Fig. 1C). We confirmed that the best methods accurately distinguish between proteins with and without TMPs (Table II). Problems arose when testing soluble proteins with signal peptides. But even for those, at least for eukaryotes and Gram-positive bacteria, the best methods (Phobius and PolyPhobius) maintained error rates below 10% (Table II). Interestingly, the incorporation of evolutionary information did not help to improve much in this respect (PolyPhobius adds evolutionary information to Phobius and performs slightly worse, Table II). In contrast, all methods performing best in correctly predicting TMHs used evolutionary information. The fact that some methods need much less CPU-resources than others (differences of over two orders of magnitude) complicated the overview even more. We did not explicitly evaluate re-entrant helices. These are predicted by only two of the 12 methods and we did show that our overall results did not change significantly for the subset of proteins without re-entrant helices. However, the growth in experimental information about re-entrant helices and their importance for topology prediction should encourage future method developers to account for this important type of membrane helices more explicitly. Another open problem was underlined by the comparisons between the TMH and the DSSP annotations, and between methods optimized to predict alpha-helices in general and TMHs: we do not have a tool that combines TMH prediction with a general secondary structure prediction in one model, although many proteins do. Our analysis was limited by several constraints, most importantly by the limitation in the dataset size. Nevertheless, we answered our initial question by providing a good impression what method to use for which task. Finally, we concluded that future improvements are both feasible and needed. ACKNOWLEDGMENTS The authors thank Tim Karl for technical assistance and Marlena Drabik for administrative support. They particularly thank Marco Punta for many helpful discussions. They also thank the anonymous reviewers for helpful suggestions and all authors who made their methods openly available and provided versions to run on our own machines. Last but not least the authors thank all who practice open science and deposit their data into public databases and those who maintain these excellent databases. REFERENCES 1. Bjarnadottir TK, Gloriam DE, Hellstr SH, Kristiansson H, Fredriksson R, Schioth HB. Comprehensive repertoire and phylogenetic analysis of the G protein-coupled receptors in human and mouse. Genomics 2006;88: Pieper U, Schlessinger A, Kloppmann E, Chang GA, Chou JJ, Dumont ME, Fox BG, Fromme P, Hendrickson WA, Malkowski MG, Rees DC, Stokes DL, Stowell MHB, Wiener MC, Rost B, Stroud RM, Stevens RC, Sali A. Coordinating the impact of structural genomics on the human alpha-helical transmembrane proteome. Nat Struct Mol Biol 2013;20: Jacoby E, Bouhelal R, Gerspacher M, Seuwen K. The 7 TM G-proteincoupled receptor target family. ChemMedChem 2006;1: Overington JP, Al-Lazikani B, Hopkins AL. How many drug targets are there? Nat Rev Drug Discov 2006;5: Kessel A, Ben-Tal N. Introduction to proteins. London, UK: CRC Press; Rost B, Fariselli P, Casadio R. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996;5: Liu J, Rost B. Comparing function and structure between entire proteomes. Protein Sci 2001;10: Fagerberg L, Jonasson K, von Heijne G. Prediction of the human membrane proteome. Proteomics 2010;10: Stevens TJ, Arkin IT. Do more complex organisms have a greater proportion of membrane proteins in their genomes? Proteins 2000; 39: White SH. The progress of membrane protein structure determination. Protein Sci 2004;13: Kloppmann E, Punta M, Rost B. Structural genomics plucks highhanging membrane proteins. Curr Opin Struct Biol 2012;22: Berman HM, Westbrook J, Feng Z, Gillil G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res 2000;28: Tusnady GE, Dosztanyi Z, Simon I. PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res 2005;33(Database issue):d275 D Lomize Ma, Pogozheva ID, Joo H, Mosberg HI, Lomize AL. OPM database and PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res 2012;40(Database issue):d370 D Jayasinghe S, Hristova K, White SH. MPtopo: a database of membrane protein topology. Protein Sci 2001;10: Arnold K, Kiefer F, Kopp J, Battey JND, Podvinec M, Westbrook JD, Berman HM, Bordoli L, Schwede T. The protein model portal. J Struct Funct Genomics 2009;10: Yarov-Yarovoy V, Schonbrun J, Baker D. Multipass membrane protein structure prediction using Rosetta. Proteins 2006;62: Goldberg T, Hamp T, Rost B. LocTree2 predicts localization for all domains of life. Bioinformatics 2012;28:i458 i von Heijne G, Gavel Y. Topogenic signals in integral membrane proteins. Eur J Biochem 1988;174: Jones DT. Protein secondary structure prediction based on positionspecific scoring matrices. J Mol Biol 1999;292: Yachdav G, Kloppmann E, Kajan L, Hecht M, Goldberg T, Hamp T, H onigschmid P, Schafferhans A, Roos M, Bernhofer M, Richter L, Ashkenazy H, Punta M, Schlessinger A, Bromberg Y, Schneider R, Vriend G, Sander C, Ben-Tal N, Rost B. PredictProtein an open resource for online prediction of protein structural and functional features. Nucleic Acids Res 2014;42(Web Server issue):w337 W Rost B. PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol 1996;266: PROTEINS 483

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1 Last time 2 3 Yet another

More information

proteins TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4

proteins TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4 proteins STRUCTURE O FUNCTION O BIOINFORMATICS TMSEG: Novel prediction of transmembrane helices Michael Bernhofer, 1 * Edda Kloppmann, 1,2 Jonas Reeb, 1 and Burkhard Rost 1,2,3,4 1 Department of Informatics

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models Last time Domains Hidden Markov Models Today Secondary structure Transmembrane proteins Structure prediction NAD-specific glutamate dehydrogenase Hard Easy >P24295 DHE2_CLOSY MSKYVDRVIAEVEKKYADEPEFVQTVEEVL

More information

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure Last time Today Domains Hidden Markov Models Structure prediction NAD-specific glutamate dehydrogenase Hard Easy >P24295 DHE2_CLOSY MSKYVDRVIAEVEKKYADEPEFVQTVEEVL SSLGPVVDAHPEYEEVALLERMVIPERVIE FRVPWEDDNGKVHVNTGYRVQFNGAIGPYK

More information

BIOINFORMATICS. Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles

BIOINFORMATICS. Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles BIOINFORMATICS Vol.? no.? 200? Pages 1 1 Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles Baoqiang Cao 2, Aleksey Porollo 1, Rafal Adamczak 1, Mark Jarrell

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached?

Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached? 550 Current Protein and Peptide Science, 2010, 11, 550-561 Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached? Gábor E. Tusnády and István Simon* Institute of Enzymology, BRC,

More information

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,

More information

The human transmembrane proteome

The human transmembrane proteome Dobson et al. Biology Direct (2015) 10:31 DOI 10.1186/s13062-015-0061-x RESEARCH Open Access The human transmembrane proteome László Dobson, István Reményi and Gábor E. Tusnády * Abstract Background: Transmembrane

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries Betty Yee Man Cheng 1, Jaime G. Carbonell 1, and Judith Klein-Seetharaman 1, 2 1 Language Technologies

More information

A benchmark server using high resolution protein structure data, and benchmark results for membrane helix predictions. Rath et al.

A benchmark server using high resolution protein structure data, and benchmark results for membrane helix predictions. Rath et al. A benchmark server using high resolution protein structure data, and benchmark results for membrane helix predictions Rath et al. Rath et al. BMC Bioinformatics 2013, 14:111 Rath et al. BMC Bioinformatics

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Structure Prediction of Membrane Proteins. Introduction. Secondary Structure Prediction and Transmembrane Segments Topology Prediction

Structure Prediction of Membrane Proteins. Introduction. Secondary Structure Prediction and Transmembrane Segments Topology Prediction Review Structure Prediction of Membrane Proteins Chunlong Zhou 1, Yao Zheng 2, and Yan Zhou 1 * 1 Hangzhou Genomics Institute/James D. Watson Institute of Genome Sciences, Zhejiang University/Key Laboratory

More information

Protein structure alignments

Protein structure alignments Protein structure alignments Proteins that fold in the same way, i.e. have the same fold are often homologs. Structure evolves slower than sequence Sequence is less conserved than structure If BLAST gives

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm

A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm Protein Engineering vol.12 no.5 pp.381 385, 1999 COMMUNICATION A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm

More information

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function Allan Lo 1, 2, Hua-Sheng Chiu 3, Ting-Yi Sung 3, Ping-Chiang Lyu 2, and Wen-Lian Hsu

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Reliability Measures for Membrane Protein Topology Prediction Algorithms

Reliability Measures for Membrane Protein Topology Prediction Algorithms doi:10.1016/s0022-2836(03)00182-7 J. Mol. Biol. (2003) 327, 735 744 Reliability Measures for Membrane Protein Topology Prediction Algorithms Karin Melén 1, Anders Krogh 2 and Gunnar von Heijne 1 * 1 Department

More information

Optimization of the Sliding Window Size for Protein Structure Prediction

Optimization of the Sliding Window Size for Protein Structure Prediction Optimization of the Sliding Window Size for Protein Structure Prediction Ke Chen* 1, Lukasz Kurgan 1 and Jishou Ruan 2 1 University of Alberta, Department of Electrical and Computer Engineering, Edmonton,

More information

Protein Structure Prediction and Display

Protein Structure Prediction and Display Protein Structure Prediction and Display Goal Take primary structure (sequence) and, using rules derived from known structures, predict the secondary structure that is most likely to be adopted by each

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data Data Mining and Knowledge Discovery, 11, 213 222, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. DOI: 10.1007/s10618-005-0001-y Accurate Prediction of Protein Disordered

More information

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Published online February 15, 26 166 18 Nucleic Acids Research, 26, Vol. 34, No. 3 doi:1.193/nar/gkj494 Comprehensive genome analysis of 23 genomes provides structural genomics with new insights into protein

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction 0 Bayesian Models and Algorithms for Protein Beta-Sheet Prediction Zafer Aydin, Student Member, IEEE, Yucel Altunbasak, Senior Member, IEEE, and Hakan Erdogan, Member, IEEE Abstract Prediction of the three-dimensional

More information

Improved Prediction of Signal Peptides: SignalP 3.0

Improved Prediction of Signal Peptides: SignalP 3.0 doi:10.1016/j.jmb.2004.05.028 J. Mol. Biol. (2004) 340, 783 795 Improved Prediction of Signal Peptides: SignalP 3.0 Jannick Dyrløv Bendtsen 1, Henrik Nielsen 1, Gunnar von Heijne 2 and Søren Brunak 1 *

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Implications of Structural Genomics Target Selection Strategies: Pfam5000, Whole Genome, and Random Approaches

Implications of Structural Genomics Target Selection Strategies: Pfam5000, Whole Genome, and Random Approaches PROTEINS: Structure, Function, and Bioinformatics 58:166 179 (2005) Implications of Structural Genomics Target Selection Strategies: Pfam5000, Whole Genome, and Random Approaches John-Marc Chandonia 1

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction Institute of Bioinformatics Johannes Kepler University, Linz, Austria Chapter 4 Protein Secondary

More information

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2

More information

Predictors (of secondary structure) based on Machine Learning tools

Predictors (of secondary structure) based on Machine Learning tools Predictors (of secondary structure) based on Machine Learning tools Predictors of secondary structure 1 Generation methods: propensity of each residue to be in a given conformation Chou-Fasman 2 Generation

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Protein Structure Prediction

Protein Structure Prediction Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on

More information

Better Bond Angles in the Protein Data Bank

Better Bond Angles in the Protein Data Bank Better Bond Angles in the Protein Data Bank C.J. Robinson and D.B. Skillicorn School of Computing Queen s University {robinson,skill}@cs.queensu.ca Abstract The Protein Data Bank (PDB) contains, at least

More information

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Burkhard Rost and Chris Sander By Kalyan C. Gopavarapu 1 Presentation Outline Major Terminology Problem Method

More information

Supporting online material

Supporting online material Supporting online material Materials and Methods Target proteins All predicted ORFs in the E. coli genome (1) were downloaded from the Colibri data base (2) (http://genolist.pasteur.fr/colibri/). 737 proteins

More information

Prediction. Emily Wei Xu. A thesis. presented to the University of Waterloo. in fulfillment of the. thesis requirement for the degree of

Prediction. Emily Wei Xu. A thesis. presented to the University of Waterloo. in fulfillment of the. thesis requirement for the degree of The Use of Internal and External Functional Domains to Improve Transmembrane Protein Topology Prediction by Emily Wei Xu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications, Cvicek et al. Supporting Text 1 Here we compare the GRoSS alignment

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

A Genetic Algorithm to Enhance Transmembrane Helices Prediction

A Genetic Algorithm to Enhance Transmembrane Helices Prediction A Genetic Algorithm to Enhance Transmembrane Helices Prediction Nazar Zaki Intelligent Systems Faculty of Info. Technology UAEU, Al Ain 17551, UAE nzaki@uaeu.ac.ae Salah Bouktif Software Development Faculty

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling 63:644 661 (2006) Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling Brajesh K. Rai and András Fiser* Department of Biochemistry

More information

A NEURAL NETWORK METHOD FOR IDENTIFICATION OF PROKARYOTIC AND EUKARYOTIC SIGNAL PEPTIDES AND PREDICTION OF THEIR CLEAVAGE SITES

A NEURAL NETWORK METHOD FOR IDENTIFICATION OF PROKARYOTIC AND EUKARYOTIC SIGNAL PEPTIDES AND PREDICTION OF THEIR CLEAVAGE SITES International Journal of Neural Systems, Vol. 8, Nos. 5 & 6 (October/December, 1997) 581 599 c World Scientific Publishing Company A NEURAL NETWORK METHOD FOR IDENTIFICATION OF PROKARYOTIC AND EUKARYOTIC

More information

Homology models of the tetramerization domain of six eukaryotic voltage-gated potassium channels Kv1.1-Kv1.6

Homology models of the tetramerization domain of six eukaryotic voltage-gated potassium channels Kv1.1-Kv1.6 Homology models of the tetramerization domain of six eukaryotic voltage-gated potassium channels Kv1.1-Kv1.6 Hsuan-Liang Liu* and Chin-Wen Chen Department of Chemical Engineering and Graduate Institute

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information

Public Database 의이용 (1) - SignalP (version 4.1)

Public Database 의이용 (1) - SignalP (version 4.1) Public Database 의이용 (1) - SignalP (version 4.1) 2015. 8. KIST 이철주 Secretion pathway prediction ProteinCenter (Proxeon Bioinformatics, Odense, Denmark; http://www.cbs.dtu.dk/services) SignalP (version 4.1)

More information

Bioinformatics: Secondary Structure Prediction

Bioinformatics: Secondary Structure Prediction Bioinformatics: Secondary Structure Prediction Prof. David Jones d.jones@cs.ucl.ac.uk LMLSTQNPALLKRNIIYWNNVALLWEAGSD The greatest unsolved problem in molecular biology:the Protein Folding Problem? Entries

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

The TOPCONS webserver for consensus prediction of membrane protein topology and signal peptides

The TOPCONS webserver for consensus prediction of membrane protein topology and signal peptides The TOPCONS webserver f consensus prediction of membrane protein topology and signal peptides Konstantinos D. Tsirigos 1,2, Christoph Peters 1,2, Nanjiang Shu 1,2,3, Lukas Käll 1,2 and Arne Elofsson 1,2,*

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare

More information

Protein Secondary Structure Prediction using Pattern Recognition Neural Network

Protein Secondary Structure Prediction using Pattern Recognition Neural Network Protein Secondary Structure Prediction using Pattern Recognition Neural Network P.V. Nageswara Rao 1 (nagesh@gitam.edu), T. Uma Devi 1, DSVGK Kaladhar 1, G.R. Sridhar 2, Allam Appa Rao 3 1 GITAM University,

More information

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Analysis of N-terminal Acetylation data with Kernel-Based Clustering Analysis of N-terminal Acetylation data with Kernel-Based Clustering Ying Liu Department of Computational Biology, School of Medicine University of Pittsburgh yil43@pitt.edu 1 Introduction N-terminal acetylation

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Protein Structure Prediction using String Kernels. Technical Report

Protein Structure Prediction using String Kernels. Technical Report Protein Structure Prediction using String Kernels Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159

More information

A hidden Markov model for predicting transmembrane helices in protein sequences

A hidden Markov model for predicting transmembrane helices in protein sequences Procedings of ISMB 6, 1998, pages 175-182 A hidden Markov model for predicting transmembrane helices in protein sequences Erik L.L. Sonnhammer National Center for Biotechnology Information Building 38A,

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Prediction of signal peptides and signal anchors by a hidden Markov model

Prediction of signal peptides and signal anchors by a hidden Markov model In J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology, 122-13. AAAI Press, 1998. 1 Prediction of signal peptides and signal anchors by a hidden Markov model Henrik

More information

Review. Membrane proteins. Membrane transport

Review. Membrane proteins. Membrane transport Quiz 1 For problem set 11 Q1, you need the equation for the average lateral distance transversed (s) of a molecule in the membrane with respect to the diffusion constant (D) and time (t). s = (4 D t) 1/2

More information

Protein Structure Prediction Using Neural Networks

Protein Structure Prediction Using Neural Networks Protein Structure Prediction Using Neural Networks Martha Mercaldi Kasia Wilamowska Literature Review December 16, 2003 The Protein Folding Problem Evolution of Neural Networks Neural networks originally

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Protein Structures: Experiments and Modeling. Patrice Koehl

Protein Structures: Experiments and Modeling. Patrice Koehl Protein Structures: Experiments and Modeling Patrice Koehl Structural Bioinformatics: Proteins Proteins: Sources of Structure Information Proteins: Homology Modeling Proteins: Ab initio prediction Proteins:

More information

Supplementary Materials for mplr-loc Web-server

Supplementary Materials for mplr-loc Web-server Supplementary Materials for mplr-loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to mplr-loc Server Contents 1 Introduction to mplr-loc

More information

Protein quality assessment

Protein quality assessment Protein quality assessment Speaker: Renzhi Cao Advisor: Dr. Jianlin Cheng Major: Computer Science May 17 th, 2013 1 Outline Introduction Paper1 Paper2 Paper3 Discussion and research plan Acknowledgement

More information

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and

More information

-max_target_seqs: maximum number of targets to report

-max_target_seqs: maximum number of targets to report Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:

More information

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Uros Midic 1 A. Keith Dunker 2 Zoran Obradovic 1* 1 Center for Information Science and Technology Temple

More information

We used the PSI-BLAST program (http://www.ncbi.nlm.nih.gov/blast/) to search the

We used the PSI-BLAST program (http://www.ncbi.nlm.nih.gov/blast/) to search the SUPPLEMENTARY METHODS - in silico protein analysis We used the PSI-BLAST program (http://www.ncbi.nlm.nih.gov/blast/) to search the Protein Data Bank (PDB, http://www.rcsb.org/pdb/) and the NCBI non-redundant

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information