BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer 2013 9. Protein Structure Prediction I
Structure Prediction Overview Overview of problem variants Secondary structure prediction Automatic extraction from 3D structures Prediction algorithms Chou-Fasman PHD Consensus Server Benchmarks: CASP 2
Basic Problem KVYGRCELAAAMKRLGLDNYR GYSLGNWVCAAKFESNFNTHA TNRNTDGSTDYGILQINSRWW CNDGRTPGSKNLCNIPCSALL SSDITASVNCAKKIASGGNGM NAWVAWRNRCKGTDVHAWIRG CRL Tertiary Structure Secondary Structure 3
Protein Structure Prediction Basic Problem: Given a sequence, predict its structure Choice of method depends on Availability of homologous structures Availability of additional experimental data Quality/accuracy of the desired model Predict backbone positions only We will model side-chains independently Techniques for this will be discussed later 4
Methods Sec. Struct. Prediction Sequence Search Sequence DB Secondary Structure Sequence Homologs Mult. Alignment + Profiles Alignment/ Profiles Ab initio Prediction Fold Recognition Threading Model Modeling/ Refinement Refined Model After: Zimmer, Lengauer: Bioinformatics From Genomes to Drugs, Wiley VCH, 2001 5
Ab Initio Prediction Prediction based on physical models only (ab initio = first principles ) Does not require information from homologous structures Prediction of new folds possible Potential Sequence Ab initio Prediction Applicable for small proteins only (<100 aa) Model 6
Threading Threading Model a target sequence onto the structures of several homologs (templates) Choose the template structure that best matches the target sequence Build a full model of the sequence based on the template Restricted to the modeling of known fold classes Fold Recognition Simplified version of the threading problem Identify fold class of the target sequence only 7
Secondary Structure Prediction Given: sequence Find: KVYGRCELAAAMKRLGLDNYRGYSLGNWVC AAKFESNFNTHATNRNTDGSTDYGILQINS RWWCNDGRTPGSKNLCNIPCSALLSSDITA SVNCAKKIASGGNGMNAWVAWRNRCKGTDV HAWIRGCRL Secondary structure assignment for three classes E (extended, strand), H (helix), C/ (coil) for every aa. KVYGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFESNFNTHATNRNTD -----HHHHHHHHH-------------EEEEE---------------- GSTDYGILQINSRWWCNDGRTPGSKNLCNIPCSALLSSDITASVNCAK ----EEEEEE--------------------------------HHHHHH KIASGGNGMNAWVAWRNRCKGTDVHAWIRGCRL HHH-------EEE-------------------- 8
Test Data To assess the quality of predictions, we need to have some gold standard This is usually done by extracting the secondary structure from high-quality crystal structures from the PDB Problem: how to extract the secondary structure from a 3D structure? DSSP and STRIDE are two well-known algorithms for automatic secondary structure assignment from a 3D structure They consider backbone torsion angles, H-bond patterns, and other parameters that are characteristic for certain secondary structures The algorithm assigns one of three/eight secondary structure classes to each aa of the structure 9
DSSP Core of DSSP is a function for the detection of H-bonds formed by the protein backbone Decision whether an H-bond exists is made based on the electrostatic energy for each acceptor/donor pair: Assumption: C=O and H-N are polarized and bear partial charges q + and q - : C=O r OH q + q - r ON q + q- q - = -0.20 e 0 q + = +0.42 e 0 Kabsch, Sander, Biopolymers (1983), 22, 2577 10
DSSP Hydrogen positions for backbone NH are constructed from standard bond length/angles (not contained in XRD data!) DSSP assumes there is an H-bond between two amino acids (i,j) if E ij is lesser than the threshold t = 2.4 kj/mol If H-bonds are present for (i,i+3), (i,i+4) or(i,i+5), this is interpreted as 3-, 4-, or 5-turn Multiple adjacent turns of the same type correspond to 3 10 -, α- and π-helices A β-bridge is assumed if there exist H-bonds for (i-1,j) and (j,i+1) [parallel] (i,j) and (j,i) [anti-parallel] Multiple adjacent β-bridges of the same type indicate the presence of β-sheets 11
STRIDE STRIDE is an improved version of DSSP Improved energy function for H-bonds Includes dependence on H-bond angle Different thresholds for helices/sheets Also considers backbone torsion angles Often recognizes amino acids at the end of a secondary structure element which DSSP would miss ( slightly longer helices/strands) STRIDE yields slightly better results than DSSP (95% correct for helices, 93% for strands; relative to manually annotated X-ray structures) Frishman, Argos, Proteins (1995), 23, 566 12
STRIDE Empirical potential for H-bond energy contains a distance-dependent contribution (E r ) and directional contribution (E t, E p ) Distance dependence is modeled by a 8-6-potential where r is the distance between donor- and acceptor atoms (N, O) and C, D are constants derived from average H-bond donor-acceptor distances The angle-dependent terms describe the deviation from the ideal bond geometry, however, they are rather complex and thus left out here (for details see Frishman & Argos, 1995) Frishman, Argos, Proteins (1995), 23, 566 13
DSSPcont Secondary structure assignment not unambiguous Structures are flexible Parts of the structure might fluctuate between different secondary structures Example: H-bonds at the end of a helix are often very close to the threshold of DSSP DSSPcont: instead of a fixed assignment, estimate probabilities for each secondary structure Andersen et al., Structure (2002), 10, 175 14
DSSPcont Apply DSSP, but compute secondary structures assignment for various thresholds t T= {-1.0, -0.9, -0.1 kcal/mol} Every aa i of the sequence will be assigned a secondary structure class c = DSSP(i, t) with c C = {G, H, I, T, E, B, S, L} for each threshold t We now define a binary function DSSP it (c) as For each sequence position i DSSP it (c) defines a 8x10-matrix with the DSSP assignments for all thresholds Andersen et al., Structure (2002), 10, 175 15
DSSPcont From this matrix DSSPcont determines the probabilities DSSPcont i (c) for each position i and class c by a scaling with empirically determined weights w it : This assigns a vector of the probabilities for each of the eight secondary structure classes to each position 16
DSSPcont Secondary structure variability (in particular at the ends of the helices) in the 23 NMR models of 1CY3 are correctly captured by DSSP This allows the identification of areas of unstable secondary structure Andersen et al., Structure (2002), 10, 175 17
Quality Measures Three-state classification (C/H/E Coil/Helix/Extended) Q 3 score: percentage of correctly assigned amino acids according to three-state classification In particular the ends of secondary structure elements are often not unambiguously classifiable (c.f. thresholding in DSSP!) Predictions with 80+% accuracy are thus excellent predicted observed 18
Quality Measures Occasionally eight-state classifications are used (H/E/G/I/T/B/S/L) 3 10 helix (G) α-helix (H) π-helix (I) helix turn (T) strand (E) β-bridge(b) bend (S) other/loop (L) Q 8 score: fraction of correctly assigned amino acids Eight classes can be mapped back to three: HELIX = 3 10 -helix + α-helix + π-helix EXTENDED = strand + β-bridge LOOP = loop + bend + helix turn Q 8 score generally smaller than Q 3 score 19
Segment OVerlap SOV Measure for the overlap between prediction and observed secondary structure, but based on the comparison of pairs of segments Compare observed (s b ) and predicted (s v ) segments of same type (type: H, C or E) 100% for entirely correct assignment minov(s b, s v ): length of the intersection of s b and s v maxov(s b, s v ): length of the union of s b and s v s b minov(s b, s v ) s v maxov(s b, s v ) predicted observed 20
Segment OVerlap SOV δ(s b,s v maxov(sb,sv)-minov(sb,s ) = min sb sv minov(sb,sv); ; 2 2 s length of segment s t {H, C, E} secondary structure type N = s b total length of all segments S(t): set of all pairs (s v, s b ) of overlapping segments of type t {H, C, E} in predicted and observed structure v ); 21
Secondary Structure Prediction Several generations of algorithms 1st Generation Consider properties of individual aa only (Q 3 50 60%) 2nd Generation Include local environment (Q 3 65%) 3rd Generation Include information from homologs (Q 3 > 70%) 4th Generation Consensus methods combining results from several other (subprediction) methods (Q 3 75-80%) 22
Chou-Fasman Algorithm Idea: amino acids differ in their affinity towards specific secondary structures Analysis of structural databases: how often is each aa found in each secondary structure type Let n j the number of occurrences of aa j in all proteins of the database Probability p j of aa j occurring in a protein is then p j = n j / j n j Similarly, define the probability to find aa j in secondary structure type k (with k {C, H, E}) as p j,k = n j,k / j n j,k Chou, Fasman, Biochemistry (1974), 13, 211 23
Chou-Fasman Algorithm Similarly defined relative probability f j,k for finding aa j in secondary structure type k: f j,k = n j,k / n j Average probability for any of the 20 aa to be found in secondary structure k can thus be written as <f k > = j f j,k / 20 = j n j,k / j n j Relative probability that aa j occurs in secondary structure k is thus: P j,k = f j,k / <f k > These relative probabilities define the preference of the individual amino acids for a certain secondary structure type Chou, Fasman, Biochemistry (1974), 13, 211 24
Chou-Fasman Algorithm Divide the 20 aa into several classes according to their P αi : Strong helix builder H α (Glu, Ala, Leu) Helix builders h α (His, Met, Gln, Trp, Val, Phe) Weak helix builders I α (Lys, Ile) Indifferent i α (Asp, Thr, Ser, Arg, Cys) Weak helix breakers b α (Asn, Tyr) Strong helix breakers B α (Pro, Gly) Similarly for β-strands: H β, h β, i β, b β, B β Chou, Fasman, Biochemistry (1974), 13, 211 25
Chou-Fasman Parameters AA P α Class AA P β Class AA P α Class AA P β Class Glu 1.53 Met 1.67 Ala 1.45 H α Val 1.65 H β Ile 1.00 I α Ala 0.93 I β Asp 0.98 Arg 0.90 Leu 1.34 Ile 1.60 Thr 0.82 Gly 0.81 i β His 1.24 Cys 1.30 Ser 0.79 Asp 0.80 i α Met 1.20 Tyr 1.29 Arg 0.79 Lys 0.74 Gln 1.17 Phe 1.28 Cys 0.77 Ser 0.72 h α Trp 1.14 Gln 1.23 Val 1.14 Leu 1.22 h β Asn 0.73 His 0.71 b α Tyr 0.61 Asn 0.65 b β Phe 1.12 Thr 1.20 Lys 1.07 I α Trp 1.19 Pro 0.59 Pro 0.62 B α Gly 0.53 Glu 0.26 B β Chou, Fasman, Biochemistry (1974), 13, 222 26
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. i α i α B α i α H α H α h α H α i α i α i α B α 0.5 0.5-1 0.5 1 1 1 1 0.5 0.5 0.5-1 27
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. i α i α B α i α H α H α h α H α i α i α i α B α 0.5 0.5-1 0.5 1 1 1 1 0.5 0.5 0.5-1 = 5 Helix start 28
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 4.3 / 4 > 1.0 Expand to the left with window of 4 aa (based on P α values!) 29
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 3.6 / 4 < 1.0 Expand to the left with window of 4 aa (based on P α values!) 30
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 4.5 / 4 > 1.0 Expand to the right with window of 4 aa (based on P α values!) 31
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 4.1 / 4 > 1.0 Expand to the right with window of 4 aa (based on P α values!) 32
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 3.2 / 4 < 1.0 Expand to the right with window of 4 aa (based on P α values!) 33
Chou-Fasman Algorithm II Example:.. T S P T A E L M R S T G.. 0.8 0.8 0.6 0.8 1.4 1.5 1.2 1.5 1.0 0.8 0.8 0.6 Similar procedure is the applied for strands 34
Chou-Fasman Algorithm III Algorithm (simplified!) Assign α/β classes to each aa of sequences S = s 1 s 2...s k A: HELICES Assign a weight w i to every aa i with w(h α ) = w(h α ) = 1, w(i α ) = 0.5, w(b α ) = w(b α ) = 1 Find helix cores Find first window of length 6 aa with w i 4 Expand cores to the left and to the right Windows of length 4 Shift to the left and right until P α s i < 4 Compatible aa of the first window no longer matching are considered part of the helix (special rule for compatibility) Chou, Fasman, Biochemistry (1974), 13, 222 35
Chou-Fasman Algorithm II Algorithm (simplified!) B: STRANDS Assign weights w i with w(h β ) = w(h β ) = 1, w(i α ) = 0.5, w(b α ) = w(b α ) = 1 Find strand cores Windows of length five with Three or more H β or h β At most one B β or b β Expand cores to the left and right Windows of four aa Shift left/right until P β s i < 4 Chou, Fasman, Biochemistry (1974), 13, 222 36
Chou-Fasman Algorithm III Algorithm (simplified!) C: CONFLICT RESOLUTION For segments marked as α and β: Calculate average P avg α and P avg β Helix, if P avg α > P avg β Strand, if P avg α < P avg β Complete algorithm contains further rules for assignments on the ends of segments and conflict resolution Chou, Fasman, Biochemistry (1974), 13, 222 37
Chou-Fasman Algorithm Online prediction: http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=misc1 Prediction accuracy rather low (50-60%) There is a whole range of improved methods: Including the prediction of turns Improved statistics (Chou, Fasman: 15 proteins!) Key problem: neighboring residues should have a strong influence and need to be considered beyond an averaging 38
Non-Locality Same sequence produces different secondary structures: Val-Asn-Thr-Phe-Val in 1ECN (80-84) and 9RSA (43-47) 1ECN 9RSA 39
Non-Locality Strands show stronger non-locality than helices: interactions between very distant sequence regions necessary for stabilization Helices: interactions only between adjacent turns of the helix (at most 5 aa removed!) 40
2nd Generation Methods Include neighboring residues Drastically improves prediction for helices Strands still difficult Wide range of methods employing of all sorts of techniques from statistical learning Artificial neural networks LDFs (Linear Discriminant Functions) Nearest-neighbor classifiers Support Vector Machines Hidden Markov Models 41
GOR Method Garnier-Osguthorpe-Robson method Several variants (GOR I GOR IV) Here: GOR IV as an example of a 2nd generation method Includes neighboring residues in a wider window Window length: GOR IV: 17 aa Common lengths of secondary structure elements: Helices ca. 5-40 aa Strands ca. 4-10 aa 42
GOR IV Instead of P ij there are now three matrices (PSSMs, positionspecific scoring matrices) One for each of the classes H, C, E Matrix entry corresponds to a probability to find a certain residue in this environment in a given secondary structure type Val Tyr......... Cys Ala YGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFES 43
GOR IV Matrix entries S α ij are determined as Score for position i is then obtained by summation over the whole window Tyr......... Gly Met YGRCELAAAMKRLGLDNYRGYSLGNWVCAAKFES 44
GOR IV Requires a large data basis to determine all matrix elements with sufficient accuracy Still leads to ambiguities, in particular at the ends of secondary structure elements Prediction quality: Q 3 64% Available online at http://abs.cit.nih.gov/ There exist further, slightly improved versions (e.g. GOR V) 45
Third-Generation Methods Only about 65% of the information required is local 1 st /2 nd generation methods cannot get much better Observation About 67% of the residues of a sequence can be exchanged without breaking the secondary structures Evolution has tried many of these neutral mutations Evolutionarily related (homologous) sequences contain this information If there are helix breakers in homologous sequences at the same position, it is unlikely that there is a helix This type of information is easily integrated through sequences profiles 46
PHD PHD uses Artificial neural networks (ANNs) for classification Profiles of homologous sequences Three-layered ANN 1st + 2nd layer: mapping of the sequence/profile onto secondary structure classes 3rd layer: majority vote on the results of the previous layers Rost, Sander, JMB (1993), 252, 584) 47
Recap: ANNs Graph defines topology Arranged in layers Weighted edges Weighted summation of input signals (nonlinear) activation function f Popular choice: f = logistic function I 1 I 2 I 3 w 1 w 2 w 3 /f 48
PHD Topology of the ANN Query.. K E L N D L E K K Y N A H I G.. Alin.... Seq.... K-HK EDAE FFFF SAAS QKKQ LLLL EEEE KEKK KQEK FFYF DDND AAAA RKKR LLLL GGGG...... 1st Layer seq.-to-struct.... 2nd Layer Struct-to-struct.. 3rd Layer Jury Decision 2.46 Helix! 0.37 1.26 After: Rost, Sander, J. Mol. Biol. (1993), 232, 584 49
PHD Post processing step then removes secondary structure elements with a length below three aa ANN is trained on DSSP-annotated X-ray structures Results: Use of profiles instead of single sequences improves Q 3 by about 6%, use of majority votes adds another 2% Improved version PHD3 improves Q 3 to about 75% 50
PSIPRED I Three-step algorithm Construction of a profile Prediction with a two-layer ANN Filtering of predictions Profile generation PSI-BLAST run (three iterations) of the sequence against a large, non-redundant protein sequences database PSI-BLAST profile (scoring matrix) serves as input to the first layer of the ANN Jones, J. Mol. Biol. (1999), 292, 195 51
PSIPRED II A window of 15 rows of the profile is used for the first layer 15 x 3 outputs of the first layer are connected to the second layer, which recognizes neighboring residues of similar secondary structure (segment filtering) 2nd layer produces final classification A C D E F G H I K L M N P Q R S T V W Y - Profile 15x21 inputs 75 hidden nodes 3 outputs 60 inputs 60 hidden nodes 3 outputs Jones, JMB (1999), 292, 195 52
PSIPRED III Training of the ANN through back propagation 2nd layer removes very short secondary structure elements Results: PSIPRED is one of the best prediction algorithms currently available Online server: http://www.psipred.net Q 3 ~ 77% Improved versions: Q 3 ~ 81% Jones, J. Mol. Biol. (1999), 292, 195 53
sspro Uses bidirectional recurrent neural (BRNN) Windows size of 41 AA Evolutionary information from multiple alignment Q 3 ~ 76% Baldi, Brunak, Frasconi, Soda, Pollastri, Bioinformatics (1999), 15, 937 54
Consensus Methods JPRED Meta Server: uses six independent methods in parallel NNSSP (a variant of SSP) PHD MULPRED (multiple predictions including GOR, Chou & Fasman) ZPRED PREDATOR DSC Majority vote for each amino acid If no clear winner: use result of PHD! Accuracy: 73% (1% better than PHD) 55
CASP5 Results CASP Critical Assessment of Structure Prediction a blind prediction competition Meta servers come out on top TOP 10 achieves SOV of about 80% (CASP4, 2000: 76%) Successful meta servers are based on sspro, PSIPRED and/or SAM-T02 (HMM approach) Helix predictions still about 10% better than those for strands Aloy et al., Proteins: Structure, Function, Genetics (2003), 53, 436 56
CASP5 Secondary Structure Aloy et al., Proteins: Structure, Function, Genetics (2003), 53, 436 57
Summary Secondary structure prediction is a first step in tertiary structure prediction Successful methods consider large sequence stretches and evolutionary information alike Meta-servers yield slightly superior results Prediction accuracies (Q 3 ) of 75-80% are possible 58
References Burkhard Rost: Prediction in 1D, In: Structural Bioinformatics (Hrsg.: P. E. Bourne, H. Weissig), Wiley, 2003 Ralf Zimmer, Thomas Lengauer: Structure Prediction, Chapter 5 in T. Lengauer (Hrsg.): Bioinformatics: From Genomes to Drugs, Wiley, 2002 59