BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

Similar documents
Protein Structures: Experiments and Modeling. Patrice Koehl

Physiochemical Properties of Residues

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Protein Secondary Structure Prediction

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Proteins: Structure & Function. Ulf Leser

Protein Structure Prediction and Display

Bioinformatics: Secondary Structure Prediction

Protein Structure Prediction

Basics of protein structure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Improved Protein Secondary Structure Prediction

Protein Secondary Structure Assignment and Prediction

Packing of Secondary Structures

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction

Bioinformatics: Secondary Structure Prediction

CAP 5510 Lecture 3 Protein Structures

Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction

Lecture 7. Protein Secondary Structure Prediction. Secondary Structure DSSP. Master Course DNA/Protein Structurefunction.

SUPPLEMENTARY MATERIALS

Predicting Secondary Structures of Proteins

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

Protein Secondary Structure Prediction

Week 10: Homology Modelling (II) - HHpred

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Protein Structure Prediction using String Kernels. Technical Report

Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction. M.N. Nguyen and J.C. Rajapakse

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Getting To Know Your Protein

Protein Structure Prediction

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Sequential resonance assignments in (small) proteins: homonuclear method 2º structure determination

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

Protein Secondary Structure Prediction

Details of Protein Structure

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

Steps in protein modelling. Structure prediction, fold recognition and homology modelling. Basic principles of protein structure

Conditional Graphical Models

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Structural Alignment of Proteins

Sequence analysis and comparison

Optimization of the Sliding Window Size for Protein Structure Prediction

Protein structure alignments

What makes a good graphene-binding peptide? Adsorption of amino acids and peptides at aqueous graphene interfaces: Electronic Supplementary

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

IT og Sundhed 2010/11

Supplementary figure 1. Comparison of unbound ogm-csf and ogm-csf as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure

Resonance assignments in proteins. Christina Redfield

7 Protein secondary structure

ALL LECTURES IN SB Introduction

Properties of amino acids in proteins

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

8 Protein secondary structure

Central Dogma. modifications genome transcriptome proteome

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version Document Published by the wwpdb

Bioinformatics Practical for Biochemists

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields

Correlations of Amino Acids with Secondary Structure Types: Connection with Amino Acid Structure

Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction

Computer simulations of protein folding with a small number of distance restraints

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Building 3D models of proteins

Supporting information to: Time-resolved observation of protein allosteric communication. Sebastian Buchenberg, Florian Sittel and Gerhard Stock 1

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Template-Based 3D Structure Prediction

12 Protein secondary structure

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Protein Structure Bioinformatics Introduction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Proteins: Characteristics and Properties of Amino Acids

Supplementary Information. Broad Spectrum Anti-Influenza Agents by Inhibiting Self- Association of Matrix Protein 1

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Peptides And Proteins

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Protein Struktur (optional, flexible)

Model Mélange. Physical Models of Peptides and Proteins

Introduction. System and methods ORIGINAL PAPER

Protein Structure Prediction

Protein Secondary Structure Prediction using Pattern Recognition Neural Network

3D Structure. Prediction & Assessment Pt. 2. David Wishart 3-41 Athabasca Hall

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Similarity or Identity? When are molecules similar?

Transcription:

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer 2013 9. Protein Structure Prediction I

Structure Prediction Overview Overview of problem variants Secondary structure prediction Automatic extraction from 3D structures Prediction algorithms Chou-Fasman PHD Consensus Server Benchmarks: CASP 2

Basic Problem KVYGRCELAAAMKRLGLDNYR GYSLGNWVCAAKFESNFNTHA TNRNTDGSTDYGILQINSRWW CNDGRTPGSKNLCNIPCSALL SSDITASVNCAKKIASGGNGM NAWVAWRNRCKGTDVHAWIRG CRL Tertiary Structure Secondary Structure 3

Protein Structure Prediction Basic Problem: Given a sequence, predict its structure Choice of method depends on Availability of homologous structures Availability of additional experimental data Quality/accuracy of the desired model Predict backbone positions only We will model side-chains independently Techniques for this will be discussed later 4

Methods Sec. Struct. Prediction Sequence Search Sequence DB Secondary Structure Sequence Homologs Mult. Alignment + Profiles Alignment/ Profiles Ab initio Prediction Fold Recognition Threading Model Modeling/ Refinement Refined Model After: Zimmer, Lengauer: Bioinformatics From Genomes to Drugs, Wiley VCH, 2001 5

Ab Initio Prediction Prediction based on physical models only (ab initio = first principles ) Does not require information from homologous structures Prediction of new folds possible Potential Sequence Ab initio Prediction Applicable for small proteins only (<100 aa) Model 6

Threading Threading Model a target sequence onto the structures of several homologs (templates) Choose the template structure that best matches the target sequence Build a full model of the sequence based on the template Restricted to the modeling of known fold classes Fold Recognition Simplified version of the threading problem Identify fold class of the target sequence only 7

Secondary Structure Prediction Given: sequence Find: KVYGRCELAAAMKRLGLDNYRGYSLGNWVC AAKFESNFNTHATNRNTDGSTDYGILQINS RWWCNDGRTPGSKNLCNIPCSALLSSDITA SVNCAKKIASGGNGMNAWVAWRNRCKGTDV HAWIRGCRL Secondary structure assignment for three classes E (extended, strand), H (helix), C/ (coil) for every aa. KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTD -----HHHHHHHHH-------------EEEEE---------------- GSTDYGILQINSRWWCNDGRTPGSKNLCNIPCSALLSSDITASVNCAK ----EEEEEE--------------------------------HHHHHH KIASGGNGMNAWVAWRNRCKGTDVHAWIRGCRL HHH-------EEE-------------------- 8

Test Data To assess the quality of predictions, we need to have some gold standard This is usually done by extracting the secondary structure from high-quality crystal structures from the PDB Problem: how to extract the secondary structure from a 3D structure? DSSP and STRIDE are two well-known algorithms for automatic secondary structure assignment from a 3D structure They consider backbone torsion angles, H-bond patterns, and other parameters that are characteristic for certain secondary structures The algorithm assigns one of three/eight secondary structure classes to each aa of the structure 9

DSSP Core of DSSP is a function for the detection of H-bonds formed by the protein backbone Decision whether an H-bond exists is made based on the electrostatic energy for each acceptor/donor pair: Assumption: C=O and H-N are polarized and bear partial charges q + and q - : C=O r OH q + q - r ON q + q- q - = -0.20 e 0 q + = +0.42 e 0 Kabsch, Sander, Biopolymers (1983), 22, 2577 10

DSSP Hydrogen positions for backbone NH are constructed from standard bond length/angles (not contained in XRD data!) DSSP assumes there is an H-bond between two amino acids (i,j) if E ij is lesser than the threshold t = 2.4 kj/mol If H-bonds are present for (i,i+3), (i,i+4) or(i,i+5), this is interpreted as 3-, 4-, or 5-turn Multiple adjacent turns of the same type correspond to 3 10 -, α- and π-helices A β-bridge is assumed if there exist H-bonds for (i-1,j) and (j,i+1) [parallel] (i,j) and (j,i) [anti-parallel] Multiple adjacent β-bridges of the same type indicate the presence of β-sheets 11

STRIDE STRIDE is an improved version of DSSP Improved energy function for H-bonds Includes dependence on H-bond angle Different thresholds for helices/sheets Also considers backbone torsion angles Often recognizes amino acids at the end of a secondary structure element which DSSP would miss ( slightly longer helices/strands) STRIDE yields slightly better results than DSSP (95% correct for helices, 93% for strands; relative to manually annotated X-ray structures) Frishman, Argos, Proteins (1995), 23, 566 12

STRIDE Empirical potential for H-bond energy contains a distance-dependent contribution (E r ) and directional contribution (E t, E p ) Distance dependence is modeled by a 8-6-potential where r is the distance between donor- and acceptor atoms (N, O) and C, D are constants derived from average H-bond donor-acceptor distances The angle-dependent terms describe the deviation from the ideal bond geometry, however, they are rather complex and thus left out here (for details see Frishman & Argos, 1995) Frishman, Argos, Proteins (1995), 23, 566 13

DSSPcont Secondary structure assignment not unambiguous Structures are flexible Parts of the structure might fluctuate between different secondary structures Example: H-bonds at the end of a helix are often very close to the threshold of DSSP DSSPcont: instead of a fixed assignment, estimate probabilities for each secondary structure Andersen et al., Structure (2002), 10, 175 14

DSSPcont Apply DSSP, but compute secondary structures assignment for various thresholds t T= {-1.0, -0.9, -0.1 kcal/mol} Every aa i of the sequence will be assigned a secondary structure class c = DSSP(i, t) with c C = {G, H, I, T, E, B, S, L} for each threshold t We now define a binary function DSSP it (c) as For each sequence position i DSSP it (c) defines a 8x10-matrix with the DSSP assignments for all thresholds Andersen et al., Structure (2002), 10, 175 15

DSSPcont From this matrix DSSPcont determines the probabilities DSSPcont i (c) for each position i and class c by a scaling with empirically determined weights w it : This assigns a vector of the probabilities for each of the eight secondary structure classes to each position 16

DSSPcont Secondary structure variability (in particular at the ends of the helices) in the 23 NMR models of 1CY3 are correctly captured by DSSP This allows the identification of areas of unstable secondary structure Andersen et al., Structure (2002), 10, 175 17

Quality Measures Three-state classification (C/H/E Coil/Helix/Extended) Q 3 score: percentage of correctly assigned amino acids according to three-state classification In particular the ends of secondary structure elements are often not unambiguously classifiable (c.f. thresholding in DSSP!) Predictions with 80+% accuracy are thus excellent predicted observed 18

Quality Measures Occasionally eight-state classifications are used (H/E/G/I/T/B/S/L) 3 10 helix (G) α-helix (H) π-helix (I) helix turn (T) strand (E) β-bridge(b) bend (S) other/loop (L) Q 8 score: fraction of correctly assigned amino acids Eight classes can be mapped back to three: HELIX = 3 10 -helix + α-helix + π-helix EXTENDED = strand + β-bridge LOOP = loop + bend + helix turn Q 8 score generally smaller than Q 3 score 19

Segment OVerlap SOV Measure for the overlap between prediction and observed secondary structure, but based on the comparison of pairs of segments Compare observed (s b ) and predicted (s v ) segments of same type (type: H, C or E) 100% for entirely correct assignment minov(s b, s v ): length of the intersection of s b and s v maxov(s b, s v ): length of the union of s b and s v s b minov(s b, s v ) s v maxov(s b, s v ) predicted observed 20

Segment OVerlap SOV δ(s b,s v maxov(sb,sv)-minov(sb,s ) = min sb sv minov(sb,sv); ; 2 2 s length of segment s t {H, C, E} secondary structure type N = s b total length of all segments S(t): set of all pairs (s v, s b ) of overlapping segments of type t {H, C, E} in predicted and observed structure v ); 21

Secondary Structure Prediction Several generations of algorithms 1st Generation Consider properties of individual aa only (Q 3 50 60%) 2nd Generation Include local environment (Q 3 65%) 3rd Generation Include information from homologs (Q 3 > 70%) 4th Generation Consensus methods combining results from several other (subprediction) methods (Q 3 75-80%) 22

Chou-Fasman Algorithm Idea: amino acids differ in their affinity towards specific secondary structures Analysis of structural databases: how often is each aa found in each secondary structure type Let n j the number of occurrences of aa j in all proteins of the database Probability p j of aa j occurring in a protein is then p j = n j / j n j Similarly, define the probability to find aa j in secondary structure type k (with k {C, H, E}) as p j,k = n j,k / j n j,k Chou, Fasman, Biochemistry (1974), 13, 211 23

Chou-Fasman Algorithm Similarly defined relative probability f j,k for finding aa j in secondary structure type k: f j,k = n j,k / n j Average probability for any of the 20 aa to be found in secondary structure k can thus be written as <f k > = j f j,k / 20 = j n j,k / j n j Relative probability that aa j occurs in secondary structure k is thus: P j,k = f j,k / <f k > These relative probabilities define the preference of the individual amino acids for a certain secondary structure type Chou, Fasman, Biochemistry (1974), 13, 211 24

Chou-Fasman Algorithm Divide the 20 aa into several classes according to their P αi : Strong helix builder H α (Glu, Ala, Leu) Helix builders h α (His, Met, Gln, Trp, Val, Phe) Weak helix builders I α (Lys, Ile) Indifferent i α (Asp, Thr, Ser, Arg, Cys) Weak helix breakers b α (Asn, Tyr) Strong helix breakers B α (Pro, Gly) Similarly for β-strands: H β, h β, i β, b β, B β Chou, Fasman, Biochemistry (1974), 13, 211 25

Chou-Fasman Parameters AA P α Class AA P β Class AA P α Class AA P β Class Glu 1.53 Met 1.67 Ala 1.45 H α Val 1.65 H β Ile 1.00 I α Ala 0.93 I β Asp 0.98 Arg 0.90 Leu 1.34 Ile 1.60 Thr 0.82 Gly 0.81 i β His 1.24 Cys 1.30 Ser 0.79 Asp 0.80 i α Met 1.20 Tyr 1.29 Arg 0.79 Lys 0.74 Gln 1.17 Phe 1.28 Cys 0.77 Ser 0.72 h α Trp 1.14 Gln 1.23 Val 1.14 Leu 1.22 h β Asn 0.73 His 0.71 b α Tyr 0.61 Asn 0.65 b β Phe 1.12 Thr 1.20 Lys 1.07 I α Trp 1.19 Pro 0.59 Pro 0.62 B α Gly 0.53 Glu 0.26 B β Chou, Fasman, Biochemistry (1974), 13, 222 26

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. i α i α B α i α H α H α h α H α i α i α i α B α 0.5 0.5-1 0.5 1 1 1 1 0.5 0.5 0.5-1 27

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. i α i α B α i α H α H α h α H α i α i α i α B α 0.5 0.5-1 0.5 1 1 1 1 0.5 0.5 0.5-1 = 5 Helix start 28

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 4.3 / 4 > 1.0 Expand to the left with window of 4 aa (based on P α values!) 29

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 3.6 / 4 < 1.0 Expand to the left with window of 4 aa (based on P α values!) 30

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 4.5 / 4 > 1.0 Expand to the right with window of 4 aa (based on P α values!) 31

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 4.1 / 4 > 1.0 Expand to the right with window of 4 aa (based on P α values!) 32

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 3.2 / 4 < 1.0 Expand to the right with window of 4 aa (based on P α values!) 33

Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 Similar procedure is the applied for strands 34

Chou-Fasman Algorithm III Algorithm (simplified!) Assign α/β classes to each aa of sequences S = s 1 s 2...s k A: HELICES Assign a weight w i to every aa i with w(h α ) = w(h α ) = 1, w(i α ) = 0.5, w(b α ) = w(b α ) = 1 Find helix cores Find first window of length 6 aa with w i 4 Expand cores to the left and to the right Windows of length 4 Shift to the left and right until P α s i < 4 Compatible aa of the first window no longer matching are considered part of the helix (special rule for compatibility) Chou, Fasman, Biochemistry (1974), 13, 222 35

Chou-Fasman Algorithm II Algorithm (simplified!) B: STRANDS Assign weights w i with w(h β ) = w(h β ) = 1, w(i α ) = 0.5, w(b α ) = w(b α ) = 1 Find strand cores Windows of length five with Three or more H β or h β At most one B β or b β Expand cores to the left and right Windows of four aa Shift left/right until P β s i < 4 Chou, Fasman, Biochemistry (1974), 13, 222 36

Chou-Fasman Algorithm III Algorithm (simplified!) C: CONFLICT RESOLUTION For segments marked as α and β: Calculate average P avg α and P avg β Helix, if P avg α > P avg β Strand, if P avg α < P avg β Complete algorithm contains further rules for assignments on the ends of segments and conflict resolution Chou, Fasman, Biochemistry (1974), 13, 222 37

Chou-Fasman Algorithm Online prediction: http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=misc1 Prediction accuracy rather low (50-60%) There is a whole range of improved methods: Including the prediction of turns Improved statistics (Chou, Fasman: 15 proteins!) Key problem: neighboring residues should have a strong influence and need to be considered beyond an averaging 38

Non-Locality Same sequence produces different secondary structures: Val-Asn-Thr-Phe-Val in 1ECN (80-84) and 9RSA (43-47) 1ECN 9RSA 39

Non-Locality Strands show stronger non-locality than helices: interactions between very distant sequence regions necessary for stabilization Helices: interactions only between adjacent turns of the helix (at most 5 aa removed!) 40

2nd Generation Methods Include neighboring residues Drastically improves prediction for helices Strands still difficult Wide range of methods employing of all sorts of techniques from statistical learning Artificial neural networks LDFs (Linear Discriminant Functions) Nearest-neighbor classifiers Support Vector Machines Hidden Markov Models 41

GOR Method Garnier-Osguthorpe-Robson method Several variants (GOR I GOR IV) Here: GOR IV as an example of a 2nd generation method Includes neighboring residues in a wider window Window length: GOR IV: 17 aa Common lengths of secondary structure elements: Helices ca. 5-40 aa Strands ca. 4-10 aa 42

GOR IV Instead of P ij there are now three matrices (PSSMs, positionspecific scoring matrices) One for each of the classes H, C, E Matrix entry corresponds to a probability to find a certain residue in this environment in a given secondary structure type Val Tyr......... Cys Ala YGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFES 43

GOR IV Matrix entries S α ij are determined as Score for position i is then obtained by summation over the whole window Tyr......... Gly Met YGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFES 44

GOR IV Requires a large data basis to determine all matrix elements with sufficient accuracy Still leads to ambiguities, in particular at the ends of secondary structure elements Prediction quality: Q 3 64% Available online at http://abs.cit.nih.gov/ There exist further, slightly improved versions (e.g. GOR V) 45

Third-Generation Methods Only about 65% of the information required is local 1 st /2 nd generation methods cannot get much better Observation About 67% of the residues of a sequence can be exchanged without breaking the secondary structures Evolution has tried many of these neutral mutations Evolutionarily related (homologous) sequences contain this information If there are helix breakers in homologous sequences at the same position, it is unlikely that there is a helix This type of information is easily integrated through sequences profiles 46

PHD PHD uses Artificial neural networks (ANNs) for classification Profiles of homologous sequences Three-layered ANN 1st + 2nd layer: mapping of the sequence/profile onto secondary structure classes 3rd layer: majority vote on the results of the previous layers Rost, Sander, JMB (1993), 252, 584) 47

Recap: ANNs Graph defines topology Arranged in layers Weighted edges Weighted summation of input signals (nonlinear) activation function f Popular choice: f = logistic function I 1 I 2 I 3 w 1 w 2 w 3 /f 48

PHD Topology of the ANN Query.. K E L N D L E K K Y N A H I G.. Alin.... Seq.... K-HK EDAE FFFF SAAS QKKQ LLLL EEEE KEKK KQEK FFYF DDND AAAA RKKR LLLL GGGG...... 1st Layer seq.-to-struct.... 2nd Layer Struct-to-struct.. 3rd Layer Jury Decision 2.46 Helix! 0.37 1.26 After: Rost, Sander, J. Mol. Biol. (1993), 232, 584 49

PHD Post processing step then removes secondary structure elements with a length below three aa ANN is trained on DSSP-annotated X-ray structures Results: Use of profiles instead of single sequences improves Q 3 by about 6%, use of majority votes adds another 2% Improved version PHD3 improves Q 3 to about 75% 50

PSIPRED I Three-step algorithm Construction of a profile Prediction with a two-layer ANN Filtering of predictions Profile generation PSI-BLAST run (three iterations) of the sequence against a large, non-redundant protein sequences database PSI-BLAST profile (scoring matrix) serves as input to the first layer of the ANN Jones, J. Mol. Biol. (1999), 292, 195 51

PSIPRED II A window of 15 rows of the profile is used for the first layer 15 x 3 outputs of the first layer are connected to the second layer, which recognizes neighboring residues of similar secondary structure (segment filtering) 2nd layer produces final classification A C D E F G H I K L M N P Q R S T V W Y - Profile 15x21 inputs 75 hidden nodes 3 outputs 60 inputs 60 hidden nodes 3 outputs Jones, JMB (1999), 292, 195 52

PSIPRED III Training of the ANN through back propagation 2nd layer removes very short secondary structure elements Results: PSIPRED is one of the best prediction algorithms currently available Online server: http://www.psipred.net Q 3 ~ 77% Improved versions: Q 3 ~ 81% Jones, J. Mol. Biol. (1999), 292, 195 53

sspro Uses bidirectional recurrent neural (BRNN) Windows size of 41 AA Evolutionary information from multiple alignment Q 3 ~ 76% Baldi, Brunak, Frasconi, Soda, Pollastri, Bioinformatics (1999), 15, 937 54

Consensus Methods JPRED Meta Server: uses six independent methods in parallel NNSSP (a variant of SSP) PHD MULPRED (multiple predictions including GOR, Chou & Fasman) ZPRED PREDATOR DSC Majority vote for each amino acid If no clear winner: use result of PHD! Accuracy: 73% (1% better than PHD) 55

CASP5 Results CASP Critical Assessment of Structure Prediction a blind prediction competition Meta servers come out on top TOP 10 achieves SOV of about 80% (CASP4, 2000: 76%) Successful meta servers are based on sspro, PSIPRED and/or SAM-T02 (HMM approach) Helix predictions still about 10% better than those for strands Aloy et al., Proteins: Structure, Function, Genetics (2003), 53, 436 56

CASP5 Secondary Structure Aloy et al., Proteins: Structure, Function, Genetics (2003), 53, 436 57

Summary Secondary structure prediction is a first step in tertiary structure prediction Successful methods consider large sequence stretches and evolutionary information alike Meta-servers yield slightly superior results Prediction accuracies (Q 3 ) of 75-80% are possible 58

References Burkhard Rost: Prediction in 1D, In: Structural Bioinformatics (Hrsg.: P. E. Bourne, H. Weissig), Wiley, 2003 Ralf Zimmer, Thomas Lengauer: Structure Prediction, Chapter 5 in T. Lengauer (Hrsg.): Bioinformatics: From Genomes to Drugs, Wiley, 2002 59