Molecular Biology Course 2009 Macromolecular Crystallography Part II Tim Grüne University of Göttingen Dept. of Structural Chemistry November 2009 http://shelx.uni-ac.gwdg.de tg@shelx.uni-ac.gwdg.de
From Experiment to Model Introduction 1/56
So Far... From Experiment to Model 2/56
The Model The electron density map is the actual result of the X-ray experiment. useless: difficult to interpret. It is per se quite Model: Atom positions Atom types Relationship between atoms: secondary structure, domains, etc... Biologically/ Chemically the model is the final goal of crystallography From Experiment to Model 3/56
Visualising a Model ball and stick CPK (space filling) C α trace(smooth) C α trace (coloured by B-factor) ball-and-stick (coloured by B-factor) ribbons The PDB File 4/56
Storing Structural Data: the PDB-File Macromolecular Structural Data from crystallography or NMR are stored at the Protein Data Bank (PDB, www.pdb.org), or the Nucleic Acid Database (NDB, ndbserver.rutgers.edu). The data are stored as PDB-files. Access to the PDB is free. Small molecule data are stored in the Cambridge Data Base (CSD), a commercial product for which a license must be obtained. Small molecule data are stored in CIF-format (which we will not discuss). The PDB-File 5/56
The PDB file an Example HEADER LIGASE 28-APR-99 1CLI TITLE AUTHOR REMARK X-RAY CRYSTAL STRUCTURE OF AMINOIMIDAZOLE RIBONUCLEOTIDE C.LI,T.J.KAPPOCK,J.STUBBE,T.M.WEAVER,S.E.EALICK 2 RESOLUTION. 2.50 ANGSTROMS.... CRYST1 71.170 211.680 94.450 90.00 90.00 90.00 P 21 21 21 16... ATOM 1 N THR A 5 15.163 80.897 61.279 1.00 20.99 N ATOM 2 CA THR A 5 15.093 82.326 61.723 1.00 22.09 C ATOM 3 C THR A 5 16.450 83.017 61.598 1.00 21.68 C... The PDB File 6/56
Storing Structural Data: the PDB-File A PDB-file is a simple text file. It contains a header with supplemental information (authors, compound, publication, etc.), the crystallographic space group and unit cell dimensions. The main part of the file are ATOM entries, one per line. An atom entry contains atom type, atom name, residue type it belongs to, and coordinates, occupancy, and B-factor. The PDB-File 7/56
Occupancy: An Example of Multiple conformation Initially the model contained only one position for the Tyrosine. But the electron density map suggests that in about half the molecules in the crystal, the side chain of the Tyrosine points in a different direction this can be modelled by setting the occupancies for both orientations of the side chain to 0.5. The PDB-File 8/56
Temperature factor of an Atom B factor Even though data are usually collected at 100 K, atoms are not immobile but vibrate thermal motion. The (isotropic) temperature (or B ) factor describes the vibration as a sphere within which the atom oscillates. This is quite a coarse assumption. At high resolution (< 1.6Å), when enough data are available, the vibrations in each of the three directions can be described separately. In that case, 6 parameters are necessary to describe the thermal motions. They are called Anisotropic Displacement Parameters (ADP). A low B-factor indicates a rigid, stable region, while a high B-factor indicates flexibility (e.g. at loops). Later will be explaned why ADP s cannot always be used and the less accurate isotropic B-factor must be used instead. The PDB-File 9/56
Illustration of the B factor Isotropic B factors Anisotropic B factors Spherical movement of atoms Ellipsoidal movement of atoms more exact The PDB-File 10/56
Data Reliability: The Data to Parameter Ratio Data to Parameter Ratio 11/56
Reliability of Data: The Data to Parameter Ratio Measurements are inexact and only approximations. The more often a value is measrued the more trustworthy is becomes: The error estimate becomes better. In macromolecular crystallography we want to determine at least the coordinates for every atom of the structure, i.e., we require 3 data points for every position. The more data were collected for a fixed number of paramersthe more reliable our model can be. We aim at a high data to parameter ratio. Data to Parameter Ratio 12/56
Data to Parameter Ratio: Example Estimates Resolution[Å] refined parameters a data/parameters ratio 3.0 x,y,z 0.9:1 2.3 x,y,z; B 1.5:1 1.8 x,y,z; B 3.1:1 1.5 x,y,z; B 5.4:1 1.5 x,y,z; U 11 U 12 U 13 U 23 U 22 U 33 2.4:1 1.1 x,y,z; U 11 U 12 U 13 U 23 U 22 U 33 6.1:1 0.8 x,y,z; U 11 U 12 U 13 U 23 U 22 U 33 16:1 a x,y,z: coordinates; B: isotropic B-value; U ij : anisotropic B-values G. Sheldrick Effectively below 1.8Å, there would not be enough data points to create a reliable model. The data to parameter ratio can be improved by additional (bio ) chemical etc. information. Data to Parameter Ratio 13/56
An Example: Data to Parameter Ratio (1/7) Scenario Measure data along a graph Experiment 1: High resolution, 21 data points with errors Experiment 2: Low Resolution, 3 data points with errors 12 12 10 21 measurements Ideal: f(x)=x 2 10 3 measurements Ideal: f(x)=x 2 8 8 6 6 4 4 2 2 0 0 2 2 4 2 1 0 1 2 3 4 2 1 0 1 2 3 Data to Parameter Ratio 14/56
An Example: Data to Parameter Ratio (2/7) Two Models Model 1: g(x) = g 2 x 2 + g 1 x + g 0 Model 2: h(x) = h 3 x 3 + h 1 x + h 0 Both Models contain three parameters, i.e., at least three data points are required for their unambiguous determination. Data to Parameter Ratio 15/56
An Example: Data to Parameter Ratio (3/7) Fitting High Resolution Data 12 10 8 6 4 2 0 2 4 data x 2 Model x 3 Model 2 1 0 1 2 3 1.19x 2 + 0.00x 0.51 χ 2 = 1.14 1 good 0.16x 3 + 0.52x + 0.47 χ 2 = 22.4 1 bad Data to Parameter Ratio 16/56
An Example: Data to Parameter Ratio (4/7) Remarks on χ 2 χ 2 is a common error estimator in statistics. χ 2 should be close to 1 for a good model. χ 2 makes a clear distinction between the two models. The reliability of χ 2 depends on a good estimate of the errors of the data points. Data to Parameter Ratio 17/56
An Example: Data to Parameter Ratio (5/7) Fitting Low Resolution Data 12 10 8 6 4 2 0 2 4 data x 2 Model x 3 Model 2 1 0 1 2 3 0.72x 2 + 0.00x + 1.17 0.48x 3 2.66x + 2.62 Data to Parameter Ratio 18/56
An Example: Data to Parameter Ratio (6/7) Problems with Fitting Low Resolution Data: Both Models fit the data perfectly. No error estimates because #data = #parameters. Additional knowledge is required to decide about the correct model. Data to Parameter Ratio 19/56
An Example: Data to Parameter Ratio (7/7) Fitting Low Resolution Data Constraints Assuming Constraint: data passes through (0, 0) Model 1: g(x) = g 2 x 2 + g 1 x +g 0 Model 2: h(x) = h 3 x 3 + h 1 x +h 0 12 10 8 6 4 2 0 2 4 data x 2 Model constraint x 3 Model constraint 2 1 0 1 2 3 0.94x 2 0.15x χ 2 = 1.30 0.83x 3 5.35x χ 2 = 14.4 Data to Parameter Ratio 20/56
Crystallographic Model Building Model Building 21/56
Model Building: Getting Started The first steps in building the model consist of finding larger groups of residues with special features. The Secondary Structure Elements of proteins are good starting points. In proteins this is the (C α ) main chain, in nucleic acids the position of the bases. α helices are particularly easy to locate, even at medium to low resolution (2.5 4Å). Model Building 22/56
Directionality of α Helices From the main chain (C α chain) one cannot determine the direction, nor which part of the sequence it covers. One gets help from the so-called Christmas tree: the side chains of an α helix point towards the N terminal end of the protein chain. Model Building 23/56
β Strands The other secondary structure element of proteins, β strands are also striking but more difficult to build. Especially the direction of the peptide chain can be difficult to find. Model Building 24/56
Sequence Docking The secondary structure basically is a Poly-Alanine model with no sequence information. Selenomethionine substituted proteins have become very popular for MAD experiments. The heavy selenium atoms are easy to find in the electron density map and help docking the sequence to the map. Disulphide bridges or metals bound to an active centre can also be helpful. Model Building 25/56
Automated Model Building Until a couple of years ago, a crystallographer had to place every residue by hand. At resolution better than, say, 2.5Å building is extremely facilitated by programs like Arp/Warp (A. Perrakis, V. Lamzin), Buccaneer (K. Cowtan), or Resolve (T. Terwilliger), which automatically build large parts of the structure in a couple of hours. Model Building 26/56
Manual Model Building Computer programs do not know about biology, certainly not of a specific molecule/structure. Human interaction is therefore required to pay attention to: presence and identification of ligands and/or metal ions (from crystallisation or protein preparation) special interaction for complexes exceptions from standard values used in refinement correct placement of solvent (water) molecules Model Interpretation Model Building 27/56
Hydrogen Atoms? X-rays interact with the electron shell of atoms. The strength of interaction is proportional to the total number of electrons. Hydrogen atoms only have one electron. They cannot be detected by X-ray diffraction (unless with very high resolution data < 1Å). During refinement, hydrogens are treated as riding atoms, that is, in a fixed position relative to the groups they belong to (like the carbons of a phenylalanine ring). Instead of completely ignoring hydrogens, this method improves the quality of the model and also aids to keep the correct distances to neighbouring groups. Because of the fixed position, riding atoms do not increase the number of parameters. Model Building 28/56
Empty Space? The Solvent Region Arrangement of molecules in the unit cell Electron density map The holes in both pictures are not vacuum. They are filled with solvent, i.e., mostly water molecules. They are disordered, therefore one does not see explicit density in these parts of the crystal. Yet, they still contribute (a little) to the diffraction pattern at low resolution. The treatment of the solvent region in crystallography leaves space for improvement. Model Building 29/56
Model Refinement Model Refinement 30/56
Refinement & Building Model Building describes the construction of the model, addition and deletion of atoms and ligands. It is mostly done by the crystallographer in front of a computer screen. Model Refinenemt describes the improvement of that model to better match the experimental data ( F meas (hkl) ). It is mostly done by computer programs. The computer program tries small changes of the coordinates and modifcation of the temperature factors to minimise the difference between calculated and measured amplitudes. Model Refinement 31/56
Excursus: Crystallographic Theory Given the structure factors F meas (hkl) F meas (hkl) exp iφ(hkl), the electron density at position (x, y, z) is given by the Fourier transformation ρ(x, y, z) = 1 V unit cell h,k,l F meas (hkl) e iφ(hkl) e 2πi(hx+ky+lz) Once a model is known with atom coordinates (x j, y j, z j ), the structure factors can be calculated from the spherical atomic scattering factors f j by F calc (h, k, l) = j f j e 2πi(hx j+ky j +lz j ) (1) The spherical atomic scattering factors f j can be calculated from per atom properties. They are also tabulated (e.g. in the International Tables for Crystallography, Volume C). They include the effect of the temperature factor. Model Refinement 32/56
Excursus: Crystallographic Theory There are two sources for the intensities I(hkl): I meas (hkl) = F meas (hkl) 2, which are measured from the X-ray experiment I calc (hkl) = F calc (hkl) 2 calculated from model coordinates. Model refinement minimises the difference between calculated and measured structure factor amplitudes (e.g. with least-squares-methods). Model Refinement 33/56
Initial Map Generation Amplitudes F (hkl) initial Map initial Model Phases φ(hkl) For the first map, phases were determined with MAD, or SIR, or Molecular Replacement, etc. These phases are generally of low quality, i.e., they have large errors compared to the real values. Model Refinement 34/56
model refinement by program (checks chemical correctness) φ calculate map new model F build model/ match model to map better w.r.t. map! data The model is created/modified based on the map. The map is calculated using the phases from the model. Therefore, the new model is biased against the old model: errors may persist. Model Refinement 35/56
Model Building and Refinement (1/2) Creating a model from X-ray data is an iterative process consisting of model building and refinement. Refinement: global improvement of the model with respect to the experimental data. Coordinates of all atoms together with their temperature factors (and sometimes, at very high resolution, even the occupancy), are moved in order to minimise the difference between the measured intensities and the ones calculated from the model. Refinement 36/56
Model Building and Refinement (2/2) Model Building: local improvement of the model with respect to the experimental data. Atoms are added, removed, or moved in order to ensure that 1. the model makes sense bio chemically (proximity of atoms, H-bonding, position of solvent molecules, etc.) 2. the model fits the calculated electron density (e.g. check for multiple conformations) Refinement 37/56
Restraints and Constraints The reflection data alone would not be sufficient to create a trustworthy model at worse than, say, 1.5Å. There are too many parameters. Therefore it is necessary to incorporate additional information. The re are two types of auxiliary information: restraints and constraints. Refinement 38/56
Restraints and Constraints Constraints reduce the number of parameters. They are expression like Property X must have value Y e.g.: temperature factor is isotropic instead of anisotropic : 4 parameters per atom instead of 9 parameters per atom Restraints increase the number of data. Should be or should be approximately expressions, e.g. distance (N C α ) 1.458Å. Restraints used in refinement encompass bond lengths and bond angles. They are important for macromolecular crystallography, and solving a structure without them would be impossible. Refinement 39/56
Traps: Local Minima Refinement programs cannot cross this barrier they would get stuck in the local minimum and could not move the Phenylalanine into the right position. These local minima and the vicious circle make validation of the model necessary. Model Refinement 40/56
Refinement: R and R free R and R free 41/56
The R Value The difference between calculated and measured amplitudes is a so-called R value R = hkl ( F meas F calc ) hkl ( F meas ) For small molecules, R values between 2% and 5% are normal, for macromolecules, the range is approximately 10% 30%. As a rule of thumb the R value should be about 1/10 of the resolution: a 2.5Å structure should have an R value of 0.25 = 25%. R and R free 42/56
Refinement and Overfitting For macromolecular molecules, the data to parameter ratio is not very high at a normal resolution range. Therefore, the R value can be nearly arbitrarily reduced by adding more and more atoms that were not really present in the crystal structure or allowing positions that chemically do not make much sense (stereochemical clashes). This is called overfitting the data. Refinement 43/56
Quality Measures (2): The R free -value One measure to reduce overfitting is the R free value. About 5% 10% of the reflections are excluded both from refinement and model building. They remain unconsidered and are like an independent judge : after refinement, the R free value is calculated like the R value, but with the excluded reflections. The two values should not differ too much (model errors) but should also not be too close (model bias). The R free value is common in statistics, but was introduced to crystallography only in the mid 90 s by Axel Brünger. Refinement 44/56
Structure Validation Validation 45/56
Why Validation? Experimental data never free of errors Scientists never free of prejudice Compared to other technical or physical disciplines, the errors in X-ray experiments are huge. It is easy to create erroneous models non-deliberately. The results - structural models - are often used by non-crystallographers. They must be able to check the quality without knowing too much about crystallography. Validation 46/56
Photoactive Yellow Protein: 1989 and 1995 1989 1995 This model was published in 1989 (PDB entry 1phy) The correct version: published six years later (PDB entry 2phy) Kleywegt, Acta D(2000), D56 NB: The first structure was published before usage of the R free and other means of validation. It is nowadays very unlikely that such coarse misinterpretations happen. Validation 47/56
Caveat: Modelling Models The structure of TBP, the TATA-box binding protein (TBP or TFIIDτ) was published in 1992 (Nikolov et al., Nature 360, pp.40 46). The shape of the molecule suggested that the TATA box sits straight in the groove of the protein. The structure of the complex, published a year later by Kim et al. (Nature 365, pp. 520 527) revealed that the DNA was actually heavily bent. Validation 48/56
Caveat: What You see is What You get? Another issue with PDB files is that they contain more information than a graphical viewer might be able to display. Many crystallographers include atoms/residues into their structures without experimental support and set their occupancy to zero. While this chemically makes sense, this procedure is error prone for users of the structure. Validation 49/56
How to Validate Validation means estimation of the model in comparison with the data. However, since the model was created by refinement against the data, the model is biased. Therefore, there is need for independent factors. All information can be used 1. that did not participate in the creation of the model/ minimisation of the model data difference 2. of which ideal/ average values are known. This means that these information must be the same or similar for all proteins. Validation 50/56
The Real Space R factor R and R free are global figures of merit: one number describes the quality of the whole structure. A local figure of merit is the real space R factor or real space correlation coefficient between model and electron density map. It expresses the fit between the electron density and the model. Validation 51/56
Dihedral Angles the Ramachandran-Plot The Ramachandran-plot is probably the most famous validation tool. It is based on the two dihedral angles ψ and ϕ. Φ is the angle between the two planes defined by C i 1 N i C α and N i C α i C i. Ψ is the angle between the two planes of N i C α i C i and C α i C i N i+1 Validation 52/56
The Ramachandran-Plot The Ramachandran plot shows the φ vs. ψ angles for a structure and the most probable regions derived from the 500 best determined protein structures. β strand α helix left handed α helix Interactive Ramachandran window of the model building program Coot. Everything outside the shaded region is an outlier Validation 53/56
The Kleywegt Plot Even more information can be read from the Ramachandran plot, if there are more than one copy of a molecule: the two (or more) copies should be rather similar to each other. If one plots the Ramachandran plot for all molecules into the same diagram and connects corresponding residues, one should NOT obtain a picture like this. Kleywegt, Acta D(2000), D56 Validation 54/56
Validation Tools also for non-crystallographers Various programs are available to check the quality of a PDB-file, e.g. WhatIF SFcheck ProCheck MolProbity The MolProbity program is available online http://molprobity.biochem.duke.edu One can upload a PDF-file or enter a PDB ID-code and various plots. It even checks the flip states of Asn, Gln, His-residue based on possible hydrogen bondings. Validation 55/56
Validation: Summary Most of the pretty pictures about proteins represent structures determined by X ray diffraction. But do not be deceived by colours and artistic compositions. Everyone who make use of PDB files / structural data should be aware of possible pitfalls. 1. Read the header information. 2. Consider the resolution and data quality 3. Does the quality and resolution match allow for the details you want to extract? 4. Make use of programs that examine structure and (if available/possible) data Validation 56/56