Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on physical principals. --Prediction uses any statistical, theoretical or empirical data to try to get at the end result. Protein Structure Prediction 1. A bit of history: Asilomar, 1994, 1996, 1998 & 2000. 2. Four approaches to structure prediction: a. Homology Modeling b. Ab initio prediction c. Sequence-Structure Threading d. Docking 3. Two ways of threading Dynamic programming Knowledge-based potentials Asilomar, 1994, 1996, 1998 & 2000 1. Asilomar is state conference ground near Carmel, Monterey. 2. December 1994: Meeting on Critical Assessment of Techniques for Protein Structure Prediction 3. December 1996 & 1998: Second and Third meeting, etc 4. Competition was held to compare/contrast methods. Asilomar 4. Competition worked like this: Experimentalists who had structure that would be solved before date of CASP meeting submitted the sequence of the unknown to central repository. Predictors could download sequence and minimal information about protein (name), and could enter one of three categories. Assessors use automatic programs for analysis in addition to expertise to evaluate quality of predictions. Asilomar Categories 1. Homology Modeling (sequences with high homology to sequences of known structure) Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence. Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known template and unknown.
Page 2 Asilomar Categories 2. Ab initio prediction (no known homology with any sequence of known structure) Given only the sequence, predict the 3D structure from first principles, based on energetic or statistical principles. Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure. New sequence: Ab initio prediction MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL Comparison of calculated (red) and experimental (blue) structures for the protein myoglobin using the refined potential function. The calculated structure is the lowest energy structure obtained from 3 different jobs with clustering and energy selection. The total simulation time on a 16 node partition CM-5 massively parallel computer was 60 hours, in which about 5 billion structures were generated. The RMS deviation of the two structures is 6.2 Å. Predict secondary structure: MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL HHHHHCCCCCHHHHHHHHHHCCCCBBBBBBBCCBBBB Predict 3D structure entirely: Asilomar Categories 3. Fold recognition (sequences with no sequence identity (<= 30%) to sequences of known structure. Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one the known folds. Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions).
Page 3 New sequence: Fold Recognition MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL Library of known folds: Asilomar Categories 3. Docking two proteins ( 96 only) Given two separate (known) protein structures, predict the geometry of their physical association.???? Use information about surface properties to find best hand/glove or lock/key fit between two known structures. Can do it by rigid body docking or flexible docking (harder) X X! X Protein Docking How to evaluate predictions? + RMSD Overall identification and topology of secondary structures Energy considerations (contacts, H-bonds) Similarity of hydrophobic core Sequence alignment quality (and systematic shift) See review of CASP4 at http://www3.interscience.wiley.com/cgi-bin/issuetoc?type=dd&id=90010623 Homology Modeling When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD). Sophisticated energy minimization techniques do not dramatically improve upon initial guess. Sample Homology Modeling MODELLER (Sali et al, see course web page) 1. Find homologous proteins with known structure and align 2. Collect distance distributions between atoms in known protein structures 3. Use these distributions to compute positions for equivalent atoms in alignment 4. Refine using energetics Rigorous criteria applied such as torsion angles, van der Waals violations, RMSD.
Page 4 Homology modeling sample. Thick backbone shows known structure. Thin lines show modeled structures. Some sidechains are not positioned correctly, but backbone and other sidechains look quite good. a. Sidechain mistakes b. Shifts with correct alignment c. No template d. Misalignment e. Incorrect template Use of sensitive multiple alignment (e.g. PSI- BLAST) techniques helped get best alignments. Sidechain modeling using libraries of known amino acid conformations. Success ranged from 45% to 80% correct (= angles within 30 of experimental structure). Energy based refinement still not improving the structures. PSI BLAST Extension of BLAST with extra features: 1. Multiple blocks aligned (not just 1) 2. Profile used iterative to increase sensitivity in picking distance sequences build profile based on initial hits use profile to conduct another search rebuilt profile repeat 5. Be careful about repeating too many times PSIBLAST DRIFT
Page 5 PSI BLAST OVERVIEW SKIP FOLD RECOGNITION AND COME BACK TO IT Ab Initio Predictions 1 to 2 : (Secondary structure prediction) Range of accuracy from 66% to 77% (3 state labeling: helix, coil or beta). Human hand editing improves the accuracy. Multiple sequence alignments improve the performance of secondary structure prediction. Ab Initio Predictions 2 to 3 : (Assemble secondary structures into 3D) Sensitive to errors in secondary structure Predictors were more likely to predict previously known structures. Ab Initio Predictions 1 to 3 : (Predict 3D from sequence only) Predict interresidue contacts and then compute structure (mild success) Simplified energy term + reduced search space (phi/psi or lattice) (moderate success) Creative ways to memorize sequence <-> structure correlations in short segments from the PDB, and use these to model new structures. ROSETTA Method. Ab Initio Predictions 1 to 3 : Good progress (3 models better than fold recognition results in CASP III) 1. Associate sequence of unknown with known 3D structure library, and then optimizing contact frequency of amino acids, as measured in PDB (Baker et al). 2. Generate all folds on lattice and then filter the bad ones out (Samudrala et al) 3. Combine multiple sequence alignment, secondary structure prediction and lattice. (Skolnick et al)
Page 6 Lattice search Rosetta Method for ab initio 1. Break target into fragments of 9 amino acids 2. Create profile, X, for target 3. Create profile, S, for similar PDB sequences 4. Align profiles X, S to get rank order list of best match fragments in the PDB (REF: Simons Baker, JMB 306: 1191-1199) Rosetta Method for ab initio 5. Start with extended chain, and evaluate the effect of introducing the fragments into the chain. 6. Use Metropolis-type algorithm for optimization, using following terms: hydrophobic burial polar side-chain interactions hydrogen bonding between beta-strands hard sphere repulsion (van der Waals) 6. Create 1000 structures, cluster them. 7. Choose one representative from each cluster as possible prediction Use an ellipsoid to be sure that hydrophobic residues are central
Page 7 CASP IV Performance Performance of Rosetta Method Alexey Murzin (Proteins Volume 45, Issue S5, 2001. Pages: 76-85) In 1996, in CASP2, we presented a semimanual approach to the prediction of protein structure that was aimed at the recognition of probable distant homology, where it existed, between a given target protein and a protein of known structure (Murzin and Bateman, [Proteins 1997; Suppl 1:105-112]). Central to our method was the knowledge of all known structural and probable evolutionary relationships among proteins of known structure classified in the SCOP database (Murzin et al., J Mol Biol 1995;247:536-540). It was demonstrated that a knowledge-based approach could compete successfully with the best computational methods of the time in the correct recognition of the target protein fold. Murzin prediction CASP IV The computational community responds Alexey can t play! Experimental Predicted
Page 8 Fold Recognition (check if sequence matches known 3D fold) CASP1: Of 21 target proteins, 11 wound up having folds that were previously known. CASP2: Of 22 targets, 15 with available folds CASP3: Of 43 targets, 36 with available folds CASP4: Of 56 target domains hard to say Every predictor does well on something. Common folds (more examples) are easier to recognize. Fold recognition was the surprise performer at the first competition. Incremental progress at second, third, fourth Fold Recognition Not all or none. List of top N hits much better than top hit. Common folds easier to recognize. Quality of alignments that result is NOT good. Potentials include: residue pair contact terms, hydrophobicity, polarity, H-bonds, local structure terms. Simple Dynamic Programming with environmental matching sometimes performs as well as sophisticated 3D potentials... Fold Recognition N-1 = target, N-2 = Fold in PDB New sequence: MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL Library of known folds:???? X X! X N-1 = target, N-2 = Fold in PDB N-1 = target, N-2 = Fold in PDB
Page 9 Fold Recognition ~ Threading ~ Inverse Folding Fold Recognition: given a sequence, and a library of backbones, find the backbone that accommodates the sequence best. Threading: Given a backbone, find the best way to mount the sequence on the backbone (with gaps) to maximize good interactions. Predictors for CASP I are along top row. Target sequences along first column. Dark grey means bad prediction, light gray pretty good, white very good. Hatched means no prediction. Upper left corner shows rank of best answer among list submitted by predictors (also shows fold used to make prediction, shift error and general protein class) Inverse Folding: (Folding = sequence to 3D). Start with 3D and find a good sequence. Elements of a fold recognition algorithm 1. Library of protein structures, suitably processed - All structures - Representative subset - Structures with loops removed 2. Scoring function - contact potential - environmental evaluation function 3. Method for generating initial alignments and/or searching for better alignments. Dynamic Programming with Environmental Strings (The subject of one of the homeworks) IDEA: Instead of aligning a sequence to a sequence, align a sequence to a string of descriptors that describe the 3D environment of the target structure. Usual DP, score matrix relates two amino acids: A R N D C Q A 2-2 0 0-2 0 R -2 6 0-1 -4 1 N 0 0 2 2-4 -1 D 0-1 2 4-5 2 C -2-4 -4-5 12-5 Q 0 1 1 2-5 4 Thread DP, relate AAs to environments in 3D structure. E1 E2 E3 E4 E5... A -0.77 1.05-0.54-0.65-1.52 R -1.80-1.52-2.35-0.11-0.41 N -1.76-2.18-2.61-0.48-0.26 D -2.48-1.80-2.63-0.80-2.08 C -0.43-0.45-0.59 0.15-0.72 Q -1.38-2.03-0.84 0.16-0.79
Page 10 What are environments. How do you compute them? Conceptually, superimpose multiple structures and look at the statistically conserved features around each 3D xyz position. This may include: Is AA buried/partially buried or exposed? If buried, how polar is the environment? If partially buried, how polar? What kind of secondary structures? (Buried status, polarity and secondary structure) 1. Align proteins with similar 3D structure. 2. Align homologous proteins by sequence alone. 3. For each position in protein, identify what environment it is by computing the local properties of interest (e.g. secondary structure, buried, polarity). 4. Count frequencies of different amino acids (within multiple alignment) in different environments. This creates a MATCH MATRIX. Bowie et al define 18 environments Another example of position-specific scores. DP threading Match Matrix Sample matrix showing alignment of amino acids and environments for globins. Entries indicate possible score for each amino acid at each environmental position, taken from match matrix. Z-Scores of DP threading for myoglobins, globins and non-globins. How do you thread a new sequence? Using standard dynamic programming, use new score matrix to align the sequence of environments from the structure of interest to the sequence of amino acids from unknown sequence. The highest scoring alignment is the best superposition of the sequence onto the structure. Using knowledge of scores of sequences with known structure, can see if the score is high enough to put the new sequence in the family.
Page 11 Advantages: DP Threading 1. Environmental proclivities may be more accurate than simple amino acid similarity: structural information local context potentially, many other features Net Result: Sample alignment B1 E2α B2α B2α E2α B2β P2β Eα Eβ Eα.. His Asp Val Ile Lys Ile Tyr Ser.. 2. Fast. 3. Pretty good performance (at Asilomar even). Disadvantages DP Threading Requires previous examples to work. Resulting match usually needs refinement May share some problems of DP in general (independence assumption from column to column, gap penalty choice, etc...) Disadvantages DP Threading Assumes average amino acid preferences overall similar protein-family environments. Doesn t compute the actual environment created by mounting the sequence on the structure. Assumes that the environment is relatively constant, and that only amino acid details change. But could have different types of interactions... Contact Potential Threading IDEA: Instead of modeling energies from first physical principles, simplify the problem by positioning only amino acids, and compute empirical energies from the observed associations of amino acids. GLU is attracted to LYS = E(glu, lys) Contact potential threading Create energy terms between amino acids: E(interaction) = -KT ln[frequency of interaction] where K is constant, T is temperature (constant), frequency of interaction measured in database of known structures. More frequent > more favorable.
Page 12 Contact potential (After Sippl et al.) More specifically: a = amino acid type a (ALA, VAL, etc...) b = amino acid type b s = separation in sequence E abs (r) = E abs (r) E s (r) Energy of interaction between a and b minus average energy at that separation equals the energy difference that contributes to stability. Contact Potential E abs (r) = -KT ln [ f abs (r) / f s (r) ] For any given sequence in 3D, compute distances between all pairs of amino acids (usually upto r = 10-15Å), and sum. E tot = Σ E abs (r) all a,b pairs Using contact potential 1. Given 3D structure, need to mount the sequence on the structure. simple dynamic programming (misses the point) other dynamic programming (better) exhautive enumeration (too expensive) recent paper shows that this is NP-hard heuristic enumeration limit on gap lengths, loop lengths (heuristic) Using contact potential Z-score. Number of standard deviations away from mean. Most meaningful for normal distributions... 2. Evaluate the contact potential for the alignment. 3. {Optional} Locally optimize the potential score. 4. Compare potential with random shuffle of sequence, and with other sequences to approximate z-score. 2SD Mean Sample threading. Other uses of contact potentials Fold recognition (as discussed here) Incorrect fold recognition detect unlikely or wrong structures bad predictions bad contacts, etc... Measure protein stability Use for ab initio prediction...
Page 13 Conclusions 1. Protein fold recognition will get asymptotically better, as we get more folds. 2. Best ab initio methods use knowledge of database, and will thus also improve. 2. Estimates are that we now have between 30% and 50% of folds that occur. 3. Given fold, we need to improve refinement with homology modeling techniques. Other information 1. http://predictioncenter.llnl.gov/ points to CASP results and targets. 2. Special journal issues devoted to CASP: Proteins 23(3), 1995 CASP2: Proteins Supplement 1, 1997 CASP3: Nature Structural Biology, Vol 6, No. 2, Feb 1999, page 108. CASP4: Proteins Vol 45 (S5), 2001.