Molecular Biology Course 2006 Protein Crystallography Part II

Molecular Biology Course 2006 Protein Crystallography Part II Tim Grüne University of Göttingen Dept. of Structural Chemistry December 2006 http://shelx.uni-ac.gwdg.de tg@shelx.uni-ac.gwdg.de

Overview Overview 1/46

Reminder A crystallographic experiment provides rather uninterpretable data, namely a list of reflections, their intensity I(hkl) = F (hkl) 2. After some extra effort, the (initial, approximate) phases φ(hkl) of the structure factors F (hkl). By a Fourier Transformation, this results in something more imaginable, the electron density map ρ(x, y, z): ρ(x, y, z) = 1 V unit cell h,k,l F (hkl) e iφ(hkl) e 2πi(hx+ky+lz) From Map to Model 2/46

From Map to Model An initial electron density (and also a final one) looks quite messy and is difficult to interpret. The final coordinate model contains more useful information. The molecule model is the final target of macromolecular crystallography. From Map to Model 3/46

Storing Structural Data the PDB file The protein models that are stored e.g. in the Protein Data Bank, PDB, http://www.pdb.org, do not represent the mere experimental data. From the experiment we get diffraction intensities and after some work the electron density ρ within the unit cell. The model is the best match (from the author s point of view) that explains the experimental data. A typical PDB-file contains a header with supplemental information (authors, compound, publication, etc.), the crystallographic space group and unit cell dimensions. The main part of the file are ATOM entries, one per line. An atom entry contains atom type, atom name, residue type it belongs to, and coordinates, occupancy, and B-factor. HEADER LIGASE 28-APR-99 1CLI TITLE X-RAY CRYSTAL STRUCTURE OF AMINOIMIDAZOLE RIBONUCLEOTIDE AUTHOR C.LI,T.J.KAPPOCK,J.STUBBE,T.M.WEAVER,S.E.EALICK REMARK 2 RESOLUTION. 2.50 ANGSTROMS.... CRYST1 71.170 211.680 94.450 90.00 90.00 90.00 P 21 21 21 16... ATOM 1 N THR A 5 15.163 80.897 61.279 1.00 20.99 N ATOM 2 CA THR A 5 15.093 82.326 61.723 1.00 22.09 C ATOM 3 C THR A 5 16.450 83.017 61.598 1.00 21.68 C... 4/46

Occupancy and B factor of an Atom Occupancy A typical crystal consist of a large number of unit cells (> 10 13 ), and the resulting model is therefore only an average of all these cells. Some atoms, especially those of large side chains (Arginine, Phenylalanine,... ) can be partially disordered, others can have several but fixed orientations. An occupancy lower than 1 indicates that an atom occupies this position in only a fraction of all unit cells. Most atoms, however, have an occupancy of 1. B factor Even though data are most often collected at 100 K, atoms are not immobile but vibrate thermal motion. The temperature (or B ) factor describes the vibration as a sphere within which the atom oscillates. For high resolution, when enough data are available, the oscillation can be considered as an ellipsoid: the B-factor splits up into a (symmetric) 3x3 matrix that describes anisotropic thermal motion in three dimensions. 5/46

Illustration of the B factor Isotropic B factors Anisotropic B factors 6/46

Occupancy: An Example of Multiple conformation Initially the model contained only one position for the Tyrosine. But the electron density map suggests that in about half the molecules in the crystal, the side chain of the Tyrosine points in a different direction this can be modelled by setting the occupancies for both orientations of the side chain to 0.5 7/46

Visualising a Model ball and stick CPK (space filling) C α trace(smooth) C α trace (coloured by B-factor) ball-and-stick (coloured by B-factor) ribbons 8/46

Data Reliability: The Data to Parameter Ratio Data to Parameter Ratio 9/46

Reliability of Data: The Data to Parameter Ratio No measurement can be exact and is only an approximation to the true value. It is therefore important to have enough data to support the deduced model. In protein crystallography we want to determine at least the coordinates for every atom of the structure. If more data are available, we add the isotropic B-value, and at best we can even determine an anisotropic B-value. Our data are the unique reflections the number of which is determined by the resolution, the space group, and the unit cell dimensions. Res.[Å] parameters data/parameters 3.0 x,y,z 0.9:1 2.3 x,y,z; B 1.5:1 1.8 x,y,z; B 3.1:1 1.5 x,y,z; B 5.4:1 1.5 x,y,z; U 11 U 12 U 13 U 23 U 22 U 33 2.4:1 1.1 x,y,z; U 11 U 12 U 13 U 23 U 22 U 33 6.1:1 0.8 x,y,z; U 11 U 12 U 13 U 23 U 22 U 33 16:1 G. Sheldrick These ratios, up to about 1.8Å, would be much too low to allow building of a proper model. The effective number of data is increased by the incorporation of additional (bio ) chemical etc. information. Data to Parameter Ratio 10/46

An Example: Data to Parameter Ratio Scenario Experiment 1: Experiment 2: High resolution, 21 data points with errors Low Resolution, 3 data points with errors 10 8 21 data points 3 data points f(x)=x 2 6 4 2 0 2 4 2 1 0 1 2 3 Data to Parameter Ratio 11/46

An Example: Data to Parameter Ratio Two Models Model 1: g(x) = g 2 x 2 + g 1 x + g 0 Model 2: h(x) = h 3 x 3 + h 1 x + h 0 Both Models require three parameters Data to Parameter Ratio 12/46

An Example: Data to Parameter Ratio Fitting High Resolution Data 12 10 8 data Model 1: 1.19x 2 +0.00x 0.51 Model 2: 0.16x 3 +0.52x + 0.47 1.19x 2 + 0.00x 0.51 χ 2 = 1.14 6 4 0.16x 3 + 0.52x + 0.47 2 χ 2 = 22.4 0 2 4 2 1 0 1 2 3 Remarks: χ 2 is a common error estimator from statistics. it should be close to 1 for a good model. χ 2 makes a clear distinction between the two models. The reliability of χ 2 depends on a good estimate of the errors of the data points. Data to Parameter Ratio 13/46

An Example: Data to Parameter Ratio Fitting Low Resolution Data 12 10 8 data Model 1: 0.72x 2 + 0.00x + 1.17 Model 2: 0.48x 3 2.66x + 2.62 0.72x 2 + 0.00x + 1.17 6 4 0.48x 3 2.66x + 2.62 2 0 2 4 2 1 0 1 2 3 Remarks: Both Models fit the data perfectly. No error estimates because #parameters = #data. Additional knowledge is required to decide about the correct model. Data to Parameter Ratio 14/46

An Example: Data to Parameter Ratio Fitting Low Resolution Data Constrained Assuming Constraint: data passes through (0, 0) Model 1: g(x) = g 2 x 2 + g 1 x Model 2: h(x) = h 3 x 3 + h 1 x 12 10 8 data Model 1: g 2 x 2 +g 1 x Model 2: h 3 x 3 +h 1 x 0.94x 2 0.15x χ 2 = 1.30 6 4 0.83x 3 5.35x 2 χ 2 = 14.4 0 2 4 2 1 0 1 2 3 χ 2 favours model 1 One constraint makes the difference between Model 1 and Model 2 a lot more striking than in the previous, non-restrained example. Data to Parameter Ratio 15/46

16/46

: Getting Started The first steps in building the model consist of finding larger groups of residues with special features. In proteins this is the (C α ) main chain, in nucleic acids the position of the bases. α helices are particularly easy to locate, even at medium to low resolution (2.5 4Å). 17/46

Directionality of α Helices From the main chain (C α chain) one cannot determine the direction, nor which part of the sequence it covers. One gets help from the so-called Christmas tree: the side chains of an α helix point towards the N terminal end of the protein chain. Selenomethionine substituted proteins have become very popular for MAD experiment. The heavy selenium atoms are easy to find in the electron density map and help docking the sequence to the map. Disulphide bridges or metals bound to an active centre can also be helpful. 18/46

β Strands The other secondary structure element of proteins, β strands are also striking but more difficult to build. Especially the direction of the peptide chain can be difficult to find. 19/46

Automated At resolution better than, say, 2.5Å building is extremely facilitated by programs like Arp/Warp (A. Perrakis, V. Lamzin) or Resolve (T. Terwilliger), which automatically build large parts of the structure. These programs can even overcome local minima. Refinement programs (either least-squares or maximum likelihood) cannot cross this barrier they would get stuck in the local minimum and could not move the Phenylalanine into the right position. 20/46

Manual Computer programs do not know about biology, certainly not of a specific molecule/structure. Human interaction is therefore required to pay attention to: presence and identification of ligands and/or metal ions (from crystallisation or protein preparation) special interaction for complexes exceptions from standard values used in refinement correct placement of solvent (water) molecules Biochemical knowledge about the features adds valuable information to the model building process. This becomes especially important at medium or low resolution (2.5Å and worse). 21/46

Hydrogen Atoms? X-rays interact with the electron shell of atoms. The strength of interaction is proportional to the total number of electrons. Hydrogen atoms only have one electron. They cannot be detected by X-ray diffraction (unless with very high resolution data, 1Å). During refinement, hydrogens are treated as riding atoms, that is, in a fixed position relative to the groups they belong to (like the carbons of a phenylalanine ring). Instead of completely ignoring hydrogens, this method improves the quality of the model and also aids to keep the correct distances to neighbouring groups. Because of the fixed position, riding atoms do not increase the number of parameters. 22/46

Empty Space? The Solvent Region Arrangement of molecules in the unit cell Electron density map The holes in both pictures are not vacuum. They are filled with solvent, i.e., mostly water molecules. They are disordered but still contribute to the diffraction pattern at low resolution. 23/46

The Solvent Model Protein crystals are not very tightly packed. The space between the molecules is filled with solvent, 50 70% of the total volume on average. Because it is disordered, it contributes mostly to reflections below 6Å resolution (d>6å). Possible ways to treat the solvent are: 1. ignore the solvent results in high R-value: Not liked by crystallographers and publishers. 2. ignore data with d>6å better R-value but worse maps: difficult to interpret. 3. consider the solvent region as a flat lake of electron density, i.e. with a low but constant average number of electrons. 24/46

Refinement 25/46

Excursus: Crystallographic Theory Given the structure factors F (hkl) F (hkl) exp iφ(hkl), the electron density at position (x, y, z) is given by the Fourier transformation ρ(x, y, z) = 1 V unit cell h,k,l F (hkl) e iφ(hkl) e 2πi(hx+ky+lz) With the inverse Fourier, the structure factors can be calculated from the electron density in the unit cell: F (h, k, l) = V unit cell d 3 x ρ(x, y, z)e 2πi(hx+ky+lz) Once a model is known, the structure factors can be calculated from the spherical atomic scattering factors f j by F (h, k, l) = j f j exp 2πi(hx j+kx j +lx j ) Finally, the spherical atomic scattering factors can be calculated from per atom properties, but they are also tabulated (e.g. in the International Tables for Crystallography, Volume C. Main Message: With a model for a molecule and a unit cell, the (theoretical) structure factors F (hkl) can be calculated. Refinement Excursus 26/46

Initial Map Generation Amplitudes F (hkl) initial Map initial Model Phases φ(hkl) For the first map, phases were determined with MAD, or SIR, or Molecular Replacement, or... These phases are generally of low quality, i.e., they have large errors. Refinement 27/46

The Vicious Circle of Refinement model refinement by program (checks chemical correctness) φ calculate map new model better w.r.t. map! F build model data The new model was made using the map. The map was made using the previous model. Therefore, the new model is biased against the old model: errors may persist. Refinement 28/46

and Refinement Creating a model from X-ray data is an iterative process consisting of model building and refinement. Refinement means global improvement of the model with respect to the experimental data. Coordinates of all atoms together with their temperature factors (and sometimes, at very high resolution, even the occupancy), are moved in order to minimise the difference between the measured intensities and the ones calculated from the model. means local improvement of the model with respect to the experimental data. Atoms are added, removed, or moved in order to ensure that 1. the model makes sense bio chemically (proximity of atoms, H-bonding, position of solvent molecules, etc.) 2. the model fits the calculated electron density (e.g. check for multiple conformations) Refinement 29/46

Local Minima and Traps Refinement can only find the next minimum of its target function. best model bad model good model Depending on the starting point (red crosses), this might result in a good or a bad model. Refinement 30/46

Quality Figures: the R value One measure to distinguish a good model from a bad one is the R value. It describes the agreement between measured amplitudes ( F obs (hkl) ) and those calculated from the model ( F calc (hkl) ). R = hkl ( F obs F calc ) hkl ( F obs ) F obs are represented by the reflection data (observations), F calc are calculated from (x,y,z) and B-values of the atoms of the model. For small molecules, R values between 2% and 5% are normal, for macromolecules, the range is approximately 20% 30%. As a rule of thumb one can expect an R value about 1/10 of the resolution: a 2.5Å structure should have an R value of 25%. Refinement 31/46

Refinement and Overfitting For macromolecular molecules, the data to parameter ratio is not very high at a normal resolution range. Therefore, the R value can be nearly arbitrarily reduced by adding more and more atoms that were not really present in the crystal structure or allowing positions that chemically do not make much sense (stereochemical clashes). This is called overfitting of data. One measure to reduce overfitting is the R free value. About 5% 10% of the reflections are excluded from minimisation of the R value. They remain unconsidered and are like an independent judge : after refinement, the R free value is calculated like the R value, but with the excluded reflections. The two values must not differ too much. More importantly, refinement has to take restraints and constraints into account. Refinement 32/46

Restraints and Constraints The reflection data alone would not be sufficient to create a trustworthy model. There are too many parameters. Therefore it is necessary to incorporate additional information. This is done by using restraints and constraints. Small molecules at high resolution can be refined unrestrained. Macromolecules are almost always refined by restrained refinement, i.e. additional information like ideal bond lengths and angles are taken into account. Constraints reduce the number of parameters. These are expression like Property X must have this value e.g. temperature factor is isotropic : 4 parameters per atom instead of 9 parameters per atom Restraints increase the number of data. Should be or should be approximately expressions, e.g. distance (N, C α ) = 1.458Å±0.019Å. Refinement 33/46

Structure 34/46

Why? Scientific (experimental) results are always afflicted with prejudice and bias be it deliberately or by accident and ignorance. Even though articles are proof read by referees, the experiment itself will hardly ever be repeated by an independent person before publications. Protein crystallography is no exception. However, crystallographic results are most often presented by colourful pictures that can easily make the reader over interpret their meaning. Since such models are used by non crystallographers, it is important for them to be able to check their quality. 35/46

Caveat: The Two Faces of Photoactive Yellow Protein 1989 1995 This model was published in 1989 (PDB entry 1phy) The correct version: published six years later (PDB entry 2phy) Kleywegt, Acta D(2000), D56 36/46

Molecular Biology Course 2006 Protein Crystallography I Tim Grüne Caveat: Modelling Models The structure of TBP, the TATA-box binding protein (TBP or TFIIDτ ) was published in 1992 (Nikolov et al., Nature 360, pp.40 46). The shape of the molecule suggested that the TATA box sits straight in the groove of the protein. The structure of the complex, published a year later by Kim et al. (Nature 365, pp. 520 527) revealed that the DNA was actually heavily bent. 37/46

Caveat: What You See Is What You Get? Another issue with PDB files is that they contain more information than a graphical viewer might be able to display. Many crystallographers include atoms/residues into their structures without experimental support and set their occupancy to zero. This could be justified because they know the residues were present in the molecule (at least for recombinant proteins). 38/46

Means for Structure means estimation of the model in comparison with the data. However, since the model was created by refinement against the data, the model is biased. Therefore, there is need for an independent judge. All information can be used 1. that did not participate in the creation of the model/ minimisation of the model data difference 2. of which ideal values are known. This means that these information must be the same or similar for all proteins. 39/46

: Model vs. Data Data collected from the crystal are of course the first source one would think of when it comes to validation. Unfortunately, in calculating the electron density, amplitudes from the data were mixed with phases from the model. This means that our model is already heavily biased against the data. This is why the 5 10% of all reflections never used for refinement in order to be able to calculate the R free value: hkl F R free = obs F calc hkl ( F obs ) 95% of all reflections (h,k,l) are used in order to calculate the R value, which is used for model refinement and optimisation. The remaining 5% of reflections are NOT used for refinement/optimisation, and the R free is calculated from them with the same formula above. Therefore, these 5% of reflections are independent from the model. 40/46

: The Real Space R factor R and R free are calculated from reflection data. They are calculated in reciprocal space. These two numbers are global figures of merit: one number tries to describe the quality of the total structure. A rather local figure of merit is the real space R factor or real space correlation coefficient. It expresses the fit between the electron density calculated from the data (reflections) and model (phases) and model only (reflections and phases). The electron density around a residue does not depend much much on residues. The resulting figures are local quality indicators. 41/46

The Real Space R Factor: an Example 42/46

: Dihedral Angles and the Ramachandran Plot A quantity that was not used in refinement, and therefore is mostly unbiased, are angles. The most famous ones are the dihedral angles ψ and ϕ, defined by the peptide main chain. Φ is the angle between the two planes defined by C i 1 N i C α and N i C α i C i, whereas Ψ is the angle between the two planes of N i C α i C i and C α i C i N i+1. Because of energetic reasons, these two angles are not independent. Their dependency is drawn in the Ramachandran plot. 43/46

: The Ramachandran Plot The Ramachandran plot shows the φ vs. ψ angles for a structure and the most probable regions derived from the 500 best determined protein structures. β strand Interactive Ramachandran window of the model building program Coot left handed α helix α helix 44/46

: Ramachandran with several Molecules Even more information can be read from the Ramachandran plot, if more than one copy of a molecule live in the asymmetric unit: the two (or more) copies should be rather similar to each other. If one plots the Ramachandran plot for all molecules into the same diagram and connects corresponding residues, one should NOT obtain a picture like this. Kleywegt, Acta D(2000), D56 45/46

: Summary Most of the pretty pictures about proteins represent structures determined by X ray diffraction. But do not be deceived by colours and artistic compositions. Everyone who make use of PDB files / structural data should be aware of possible pitfalls. 1. Read the header information. 2. Consider the resolution and data quality 3. Does the quality and resolution match allow for the details you want to extract? 4. Make use of programs that examine structure and (if available/possible) data Interpretation of data is important for science, but one must not exaggerate and stay close to the facts. 46/46