SUPPLEMENTARY MATERIALS

Similar documents
BIOINFORMATICS. Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Protein Secondary Structure Prediction

Improved Protein Secondary Structure Prediction

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Protein Structure Prediction Using Neural Networks

Basics of protein structure

CAP 5510 Lecture 3 Protein Structures

Protein structure alignments

Protein Secondary Structure Prediction

Protein Secondary Structure Assignment and Prediction

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Protein Secondary Structure Prediction using Feed-Forward Neural Network

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

IT og Sundhed 2010/11

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Computational Genomics and Molecular Biology, Fall

Bioinformatics: Secondary Structure Prediction

Protein Structure Prediction and Display

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV

addresses: b Department of Mathematics and Statistics, G.N. Khalsa College, University of Mumbai, India. a.

Protein Structure Prediction using String Kernels. Technical Report

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

CHAPTER 29 HW: AMINO ACIDS + PROTEINS

Steps in protein modelling. Structure prediction, fold recognition and homology modelling. Basic principles of protein structure

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Protein Structure Prediction, Engineering & Design CHEM 430

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function

SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Review. Membrane proteins. Membrane transport

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields

Tutorial: Structural Analysis of a Protein-Protein Complex

Motif Prediction in Amino Acid Interaction Networks

A New Similarity Measure among Protein Sequences

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Bioinformatics: Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Bioinformatics Practical for Biochemists

Optimization of the Sliding Window Size for Protein Structure Prediction

Model Accuracy Measures

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Physiochemical Properties of Residues

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Analysis and Prediction of Protein Structure (I)

Conditional Graphical Models

Protein quality assessment

Supporting Online Material for

Packing of Secondary Structures

Towards De Novo Folding of Protein Structures from Cryo-EM 3D Images at Medium Resolutions

TMHMM2.0 User's guide

Orientational degeneracy in the presence of one alignment tensor.

Protein Secondary Structure Prediction using Pattern Recognition Neural Network

Introduction to Comparative Protein Modeling. Chapter 4 Part I

BCB 444/544 Fall 07 Dobbs 1

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

From Amino Acids to Proteins - in 4 Easy Steps

Algorithm-Independent Learning Issues

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Holdout and Cross-Validation Methods Overfitting Avoidance

Protein Structure and Function Prediction using Kernel Methods.

Sequence analysis and comparison

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Model Mélange. Physical Models of Peptides and Proteins

Heteropolymer. Mostly in regular secondary structure

Prediction. Emily Wei Xu. A thesis. presented to the University of Waterloo. in fulfillment of the. thesis requirement for the degree of

) P = 1 if exp # " s. + 0 otherwise

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

10-701/ Machine Learning, Fall

Protein Structure: Data Bases and Classification Ingo Ruczinski

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

SUPPLEMENTARY INFORMATION

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

ALL LECTURES IN SB Introduction

Supporting Information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Predictors (of secondary structure) based on Machine Learning tools

SUPPLEMENTARY INFORMATION

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Section III - Designing Models for 3D Printing

Getting To Know Your Protein

AI Programming CS F-20 Neural Networks

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES

Transcription:

SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact: jmeller@chmcc.org; Server available at http://minnou.cchmc.org SP1. Multistage protocol for transmembrane helix prediction in MINNOU In addition to a simple NN-based classifier developed and assessed in the cross-validation study (see the Systems and Methods section of the main body of the paper), we also developed a multistage protocol for enhanced prediction of transmembrane (TM) helices. For the final predictor we do not consider the MAbased representation, which is shown using cross-validation to yield a lower accuracy compared with the proposed compact prediction-based structural profiles. We also excluded hydropathy scales from the representation of an amino acid residue, as it was observed to lead to a higher level of confusion with globular proteins and signal peptides. Moreover, statistical propensities of amino acids (e.g. to lipid aqueous phase interfaces) that could be used, in principle, to improve the prediction of TM segments, are not taken into account at this stage. Consequently, each residue is initially represented by five numbers: the predicted real valued RSA, confidence of RSA prediction and probabilities of each of the three secondary structures (as predicted by SABLE, http://sable.cchmc.org). SABLE predictions are derived from the multiple alignment, hydropathy scales and other attributes, which are commonly used by other state-of-theart methods. Therefore, it is expected that other accurate methods for RSA and SS prediction will be useful in that regard as well. This is a subject of a future investigation. Following in the footsteps of other studies (Rost et al., 1995), we use a two-stage prediction system, with the second layer (structure-to-structure) NNs allowing one to average and smooth over the initial classification obtained using the first (sequence-to-structure) layer predictor. The architecture of the first and second layer NNs is similar to that used for the cross-validation study. Namely, a simple feedforward topology with one hidden layer, fully interconnected with the input and output layers, is employed. The choice of the sliding window size, the number of nodes in the hidden layer, training protocols and other characteristics of these NNs are discussed below. In order to reduce the danger of overfitting and achieve regularization as well as improvement in accuracy, a consensus of twenty different networks was used to generate predictions at each stage. These different networks were trained on different subsets of the training set that were also used for the crossvalidation study. Multiple NNs were trained on each subset of the data, with different number of nodes in the hidden layer and with different size of the sliding window. The number of nodes in the hidden layer was again varied between 8 and 18 and the size of the sliding window was varied between 11 and 31. The stopping criteria were defined in terms of improvement (or lack thereof) on the corresponding validation set. From multiple networks trained on each subset of the data, the one providing the highest per residue accuracy on the corresponding validation set was then selected for the final consensus predictor. In that sense, the whole training set of 73 protein chains is used here to optimize the parameters of each of the networks. However, using different subsets of the data provides a better sampling of local minima. The first ten networks were trained on different subsets of the data using the representation with five attributes per amino acid residue, as described above. The other ten networks included in the consensus were trained with only four attributes per residue. Namely, the two nodes related to relative solvent accessibility prediction are replaced by one node, which represents a discretized RSA assignment. In other words, such a binary node indicates if a given residue is predicted to be buried or exposed (with a threshold of 25% RSA used to project the real valued SABLE prediction into two classes). Adding these additional networks to the consensus was found to improve somewhat the per segment classification accuracy. The consensus predictor is based on a simple majority voting. The source code for the MINNOU package, with the definition of all the NNs used for the consensus prediction (including a detailed description of their topology and parameters) can be downloaded from ftp://ftp.chmcc.org/pdi/jmeller/minnou/. To further improve the performance of the first layer classifier, we introduced a second layer prediction system. The output of the first layer is used at this stage as input. In addition, the SABLE predictions used in 1 st layer are included in the input as well. After some experimentation we found that the best results are obtained when the output of the first layer system is combined as follows to be included in the input of the 2 nd layer. For each residue, two additional nodes are used to represent averaged (over the

networks included in the consensus) excitation of the two output nodes corresponding to two different classes. Another pair of nodes is used to represent the maximum (again over the different networks) excitation for each class. Thus, the number of nodes per residue in the 2 nd layer is either nine or eight, depending on the number of attributes used in the first layer. The choice of optimal topology and other settings for the neural networks and the training procedures follows that for the 1 st layers (see also the MINNOU package for further details). While the second layer NNs lead to significant smoothing of the prediction and improves the overall accuracy in terms of both: sensitivity and specificity, some long or short helices are still occasionally predicted. We estimated the probability density distribution for the length of TM helices and used it as a guideline in the design of a filter, applied to the second layer prediction in order to avoid such unphysical predictions. Similar filters have been used before by other groups (Rost et al., 1995). Basically, the final filter is applied to either split predicted long TM helices or delete too short ones, which are observed with very low frequency in the known sample of helical TM domains. The presence of relatively short (e.g., horizontally oriented) membrane embedded helices as well as relatively long ( skewed ) helices that occur in some ion channels, for instance, influences the choice of the length thresholds applied here. Specifically, if only a single membrane segment is predicted in a protein, then it is deleted if its length is shorter than 14 residues; if more than one membrane segment is predicted, then it is deleted if shorter than 8 residues. On the other hand, if there is a continuous membrane segment of length greater than 44, it is split into two segments in the middle (by introducing an artificial loop of length one); and if the predicted segment is longer than 66 residues (the latter happened seven times in the set of 2247 segments predicted for the TMH Benchmarks test), then it is split into three segments. Even though this final post-processing step does not affect significantly the per residue accuracy, it does help to smooth the predictions and to improve the per segment accuracy, Q OK, as defined in (Chen et al., 2002; Kernytsky and Rost, 2003). It also helps reducing further the observed level of confusion with signal peptides and globular proteins by filtering out unphysical short TM helices (see Table 6 and also the Results section of the main body of the paper for further discussion). However, this final prediction stage is likely to be improved further, for example, by considering explicitly the overall topology of membrane proteins and characteristics of loops connecting TM segments. This is a subject of a future work.

SP2. Tables and figures Table 5 Per residue classification accuracy (Q 3 ) of secondary structure predictions obtained using SABLE (Adamczak et al., 2005) and PSIPRED servers (Jones, 1999). The results on non-redundant sets of 73 alpha-helical and 15 beta-barrel membrane proteins, respectively, are shown. Test set SABLE PSIPRED Alpha-helical: all 78.2% 76.6% Alpha-helical: TM segments only 83.3% 80.6% Beta-barrel: all 68.7% 75.7% Beta-barrel: TM segments only 64.9% 77.6% Table 6 Improvements due to 1 st layer, 2 nd layer and filter (F) based predictors according to TMH Benchmark evaluation. Per segment (Q OK ) and per residue (Q 2 ) classification accuracies and confusion levels with signal peptides (csp) and globular proteins (cgp) are given in the units of per cent. Five different prediction systems are compared. The first two involve the first layer predictions: (a) training with the standard training set containing membrane proteins only; and (b) training with the augmented training set that includes signal peptides and some false positives from the first iteration. The next two include the corresponding results for the second layer predictors that use the results of the first layer classifiers as part of their input. The last row contains results of the second layer prediction with filtering of short, unphysical segments, which improves the prediction in some respects. High resolution Low resolution Q OK Q 2 Q OK Q 2 1 st layer (a) 66 88 32 83 78 27 1 st layer (b) 61 88 40 85 54 12 2 nd layer (a) 66 89 24 82 64 23 2 nd layer (b) 66 88 39 84 27 8 2 nd layer (F) 80 89 55 85 8 1 Figure 1 Examples of MINNOU transmembrane helix predictions for an ion channel, PDB code 1OTS, (panel A), and a glutamate transporter protein, PDB code 1XFH (panel B). The amino acid sequence is included in the first row, with the actual and predicted membrane segments highlighted using bold and yellow boxes, respectively. The secondary structures and relative solvent accessibilities, predicted using the SABLE server, are shown in the second and third rows, respectively. The alpha-helices are represented by red ribbons, the beta-strands by green arrows and coils by blue lines, respectively. The level of predicted aqueous solvent exposure is represented by shaded boxes, with fully buried and fully exposed residues represented by black and white boxes, respectively. The secondary structures observed in the experimentally resolved structures are shown in the last row for comparison. See text for further details.

A. B.

Figure 2 The distribution of lengths for TM segments included in the MPtopo (Jayasinghe et al., 2001) database of membrane proteins. 3D_helix and 1D_helix refer to helical membrane proteins with or without resolved 3D structure, respectively (see text for details). Thus, in case of 1D_helix set only low resolution information derived from various experimental studies is available. Consequently, the delineation of TM segments is laden with additional uncertainty. Moreover, it was suggested that prediction methods may have been used to derive the actual boundaries of TM segments for some of the 1D_helix proteins (Chen et al., 2002). As can be seen from the figure, the distribution of lengths for low resolution structures is indeed qualitatively different, compared to high resolution (3D_helix) set. In particular, a sharp peak is observed around the canonical length of an alpha-helical TM segment (20-22 residues) for low resolution structures. This is in stark contrast to a much wider distribution for high resolution structures. Because of these differences and higher uncertainty in the annotation of low resolution structures, we used only 3D_helix set for training of MINNOU. The 3D_other set includes beta barrel membrane proteins with known 3D structures and is characterized by much shorter membrane spanning segments. 70 60 3dhelix 1dhelix 3dother Number of TM segments 50 40 30 20 10 0 0 5 10 15 20 25 30 35 40 45 50 TM length

Figure 3 Dependence of prediction accuracy on the size of the sliding window, using 10-fold crossvalidation on the non-redundant training set of 73 alpha-helical membrane protein chains (see text for details). The results of a simple NN, with one hidden layer consisting of 10 nodes, trained for 1000 cycles with the default learning parameter and the Rprop delta parameter set to 0.1, are compared in terms of Matthews Correlation Coefficients (MCC). Fixed topology and meta-parameters of the networks allow one to compare the results directly and assess the extent of overfitting due to increased windows size. Each amino acid residue in the sliding window is represented here by five numbers, using the prediction-based structural profiles described in the text. Thus, increasing the size of the sliding window from 11 to 21 increases the number of weights to be optimized from 550 to 1050 (neglecting the weights for edges between the hidden layer and the two nodes in the output layer as well as the parameters for the activation functions of the nodes in the hidden and output layers). As can be seen from the figure, the cross-validation estimates of the accuracy quickly increase initially in order to reach a plateau and then gradually decrease with the growing size of the sliding window, pointing out problems with overfitting for more complex models. It should be noted that changing the topology of the network we were able to obtain somewhat better results (at the level of MCC=0.74, as reported in Tables 1 and 2 for different sliding windows). We would also like to comment that while very short sliding windows seem to be capable of capturing relatively well the essential information about an amino acid residue and its environment (since not too many residues predicted to be fully buried and in alpha helices are found in the loop domains of TM proteins included in the training set) the overall quality of predictions based on such short windows is low due high level of confusion with globular proteins and low per segment accuracy. 0.75 0.73 0.71 0.69 0.67 MCC 0.65 0.63 0.61 0.59 0.57 0.55 0 5 10 15 20 25 30 Sliding Window Size