Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU NO!

Identification of Protein-model accuracy Why is it important? What is accuracy RMSD, fraction correct, Protein model correctness/quality Procheck, Whatif, ProsaII, Verify3d Prediction of protein model accuracy ProQ server

Why is it so important Reliable fold recognition P-value, E-value, Z-score Tells you if you should believe in the fold!! Alignment (model construction) No obvious method to estimate reliability of alignment Number of gaps, length of gaps Amino acids in protein core and loops % id is too conservative Many low homology models are accurate, and some high homology model are wrong Correct fold, wrong alignment => Terrible model How to gain confidence in a protein model?

Model accuracy. Swiss-model. 1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)

What is protein model accuracy Model quality (correctness) Does the model look like a protein? Hydrophobic residues in core, hydrophilic on surface Backbone geometry (phi/psi angles, bond-length) Amino acid environment A correct model can be completely wrong Accuracy (if we know the answer) RMSD Fraction of correct modeled residues

Model accuracy Rmsd = sqrt(1/n S (d ij ) 2 ) Fraction correct = N c /N Nc = number correct Blue model Yellow structure d ij

Evaluation of model quality Check for proper protein stereochemistry ProCheck (http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery) Ramachandran plot, bond-length, Whatif (http://www.cmbi.kun.nl/gv/servers/wiwwwi) Packing quality Both web-servers Fitness of sequence to structure ProsaII (http://lore.came.sbg.ac.at/services/prosa.html) Program runs on Linux and Unix Verify3D (http://www.doe-mbi.ucla.edu/services/verify_3d/) Web-server

Amino acid environment 1.000.000 of different protein sequences 10.000 different solved protein structures 600 different protein folds Typical amino acid environment Sequence space large Structure space small

CaNCCa y, f = -60 degrees b strand Dihedral angles y, f y, f = 180 degrees Peptide planes Peptide backbone geometry l l l l a helix From speedy.st-and.ac.uk/.../lectures/ 3014/lecture/dars1.htm

Ramachandran plot B. Beta strand A. Right handed helix L. Left handed helix Color coding White. Disallowed Red. Most favorable Yellow. Allowed region Glycine triangles A B L

Wrong structure 1RIP Ribosomal protein. NMR structure in PDB database 17- Aug, 1993

Procheck. Bond length

What-if. Fine packing Quality Statistical description of local chemical environment in high quality protein structures Superimpose tryptophans and find average local environment. Same for other amino acids Full atom model G. Vriend and C. Sander, 1992

Example. Casp Model T0133 T0133 Casp5 target Modeled by X3M (Lund, O., 2002) RMSD=7.3

Casp Model - Fine packing quality ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 1 ILE ( 33 ) 2-0.737-0.462 0.331-1.312-0.865 2 SER ( 34 ) 2-0.241 0.209-0.021-1.437-1.421.. 245 ALA ( 296 ) 2-1.919-1.770-1.264 0.000 0.000 246 GLU ( 297 ) 3-1.384-0.641-1.400 0.070-1.132 247 HIS ( 298 ) 3-1.476-1.211-1.736-0.874-1.427 ============================================================ All contacts : Average = -0.459 Z-score = -3.05 BB-BB contacts : Average = -0.155 Z-score = -1.14 BB-SC contacts : Average = -0.445 Z-score = -2.94 SC-BB contacts : Average = -0.221 Z-score = -1.39 SC-SC contacts : Average = -0.701 Z-score = -4.10 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Bad model

T0133 structure - Fine packing quality ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 18 ILE ( 33 ) A 2 0.781 1.018-0.116 0.661-0.291 19 SER ( 34 ) A 2 1.435 1.467 0.077 2.284 0.134.. 281 ALA ( 296 ) A 2-2.272-2.504-0.404 0.000 0.000 282 GLU ( 297 ) A 2-0.778-1.601-1.256 0.137 1.471 283 HIS ( 298 ) A 3-0.836-0.801-0.948-1.094 0.351 ============================================================ All contacts : Average = 0.001 Z-score = -0.04 BB-BB contacts : Average = -0.040 Z-score = -0.40 BB-SC contacts : Average = 0.139 Z-score = 0.90 SC-BB contacts : Average = -0.196 Z-score = -1.23 SC-SC contacts : Average = -0.024 Z-score = 0.02 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Good model

ProsaII (Potential of Mean Force) Likelihood of amino acid packing Exposure potential for D Method developed by Manfred Sippl., 1993 Works for Ca-models For high quality protein structure estimate nearest neighbor counts for all aa E = -log(p(n a)/p(n)) Hydrophobic residues tend to have many neighbors (buried) Hydrophilic residues tend to have fewer N (exposed) Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990).

ProsaII (Potential of Mean Force) Likelihood of amino acid packing Pair potential for D, E. s=3 E = - log(p(r abs)/p(r s)) s a b r Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990).

Verify 3D (Eisenberg et al. 1997) Closely related to ProsaII exposure potential. How well does aa fit its local environment (hydrophobic/hydrophilic) T0133 Casp5 target Modeled by X3M (Lund, O., 2002) RMSD=7.3 Red: Crystal structure, Blue: Model

Sequence has poor match to structure Model T0133. Verify 3D

ProQ. Prediction of Model accuracy Neural network to identify correct protein models. B. Wallner and Arne Elofsson, 2003 http://www.sbc.su.se/~bjorn/proq Input, a pdb structure/model Output, accuracy measure LGscore Maxsub score

ProQ Input to neural net Atom-atom contacts C, N, O How often is C in contact with N? Residue-residue contacts How ofter is E in contact with D? Solvent accessibility surface Average exposure of L s Secondary structure prediction How consistent is prediction with model?

Casp model T0113

Structure 1RIP

LifeBench data 11000 Models 220 targets Modeled by Pcons Incorrect model Lgscore <1.5 Maxsub < 0.1

Conclusions Correct protein models cannot reliably be identified!! Protein fold on the other hand can! Many methods from the protein crystallography world are useful to identify wrong models Bad models can pass all filters ProQ is a first attempt of an accurary prediction server Can integrate information from many sources Future will show if this approach can provide reliable prediction of model accuracy